apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.26k stars 3.2k forks source link

[Bug] failed to get tablet. reason=tablet does not exist #4856

Open ccoffline opened 3 years ago

ccoffline commented 3 years ago

Describe the bug

errCode = 2, detailMessage = failed to get tablet. tablet_id=18444553, with schema_hash=2071938847, reason=tablet does not exist

To Reproduce

When I delete any file in be/storage/data and launch a query, I'll get

errCode = 2, detailMessage = failed to initialize storage reader. tablet=10051.2107108059.d1497902f3afdc62-b351656515255ca8, res=-402, backend=****

and it comes from https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/exec/olap_scanner.cpp#L117-L133

So I cannot directly reproduce it.

Troubleshoot

First, _tablet gets nullptr and comes out this error. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/exec/olap_scanner.cpp#L73-L83

Here tablet = _get_tablet_unlocked(tablet_id, schema_hash); must be nullptr and didn't get corrected later. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/olap/tablet_manager.cpp#L582-L610

Here tablet_map cannot find the tablet https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/olap/tablet_manager.cpp#L1409-L1428

I have no idea about when this _tablet_map_array will be updated and why can't BE find the tablet. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/olap/tablet_manager.h#L199-L214

morningman commented 3 years ago

Does it related to #4841 ?

EmmyMiao87 commented 3 years ago

Firstly you need to check whether the tablet is indeed not on the be node.

ccoffline commented 3 years ago

Does it related to #4841 ?

It has the same error, but not with colocate join. And it was fixed a few hours later. It blocks the query at the time, shouldn't FE retry on other BE and continue the query?

morningman commented 3 years ago

shouldn't FE retry on other BE and continue the query?

It depends on query. Currently, FE will retry the query only for RPC exception. And if some data has already been sent to client, FE will not retry.

ccoffline commented 3 years ago

Firstly you need to check whether the tablet is indeed not on the be node.

The meta_tool is too hard to use because it only take one root_path and we have multiple ones. I was planning to work on this. And I cannot reproduce it, so any advice on how I can trigger this, delete some files or others?

morningman commented 3 years ago

The meta_tool is too hard to use because it only take one root_path and we have multiple ones. I was planning to work on this.

You can easily using show tablet 18444553; and then executing "shwo proc" to get the meta url, and check it.

ccoffline commented 3 years ago

You can easily using show tablet 18444553; and then executing show proc to get the meta url, and check it.

The query may success at the time. And when it failed, it didn't show which BE had this error. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/exec/olap_scanner.cpp#L73-L83

It was killing me to find out which be. I will submit a PR to add BE host to this later.

ccoffline commented 3 years ago

I wonder how you reproduce this error when debugging #4841 @morningman

morningman commented 3 years ago

It was killing me to find out which be.

That is an issue, better to and BE ip info in error log.

I wonder how you reproduce this error when debugging #4841 @morningman

1 FE and 4 BE. create a table with 1 bucket and 3 replica. insert some data, and query:

seletc * from (a join a) union all (a jion a);

ccoffline commented 3 years ago

@morningman Could you please be more specific? I have 1 FE and 5 BE, having sql below

CREATE DATABASE IF NOT EXISTS `test`; USE `test`;
DROP TABLE IF EXISTS `test_t`;
CREATE TABLE `test_t` (
  `name` VARCHAR(1000) NULL COMMENT "string"
) ENGINE=OLAP
DUPLICATE KEY(`name`)
DISTRIBUTED BY HASH(`name`) BUCKETS 1
PROPERTIES (
"replication_num" = "3",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);

INSERT INTO test_t (`name`)
VALUES ('aaaaa'), ('ccccc')
;

select * from test_t a join test_t b union all select * from test_t c join test_t d;

I tried a lot of time but this sql never trigger the error.