Open ccoffline opened 4 years ago
Does it related to #4841 ?
Firstly you need to check whether the tablet is indeed not on the be node.
Does it related to #4841 ?
It has the same error, but not with colocate join. And it was fixed a few hours later. It blocks the query at the time, shouldn't FE retry on other BE and continue the query?
shouldn't FE retry on other BE and continue the query?
It depends on query. Currently, FE will retry the query only for RPC exception. And if some data has already been sent to client, FE will not retry.
Firstly you need to check whether the tablet is indeed not on the be node.
The meta_tool
is too hard to use because it only take one root_path
and we have multiple ones. I was planning to work on this.
And I cannot reproduce it, so any advice on how I can trigger this, delete some files or others?
The
meta_tool
is too hard to use because it only take oneroot_path
and we have multiple ones. I was planning to work on this.
You can easily using show tablet 18444553;
and then executing "shwo proc" to get the meta url, and check it.
You can easily using
show tablet 18444553;
and then executingshow proc
to get the meta url, and check it.
The query may success at the time. And when it failed, it didn't show which BE had this error. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/exec/olap_scanner.cpp#L73-L83
It was killing me to find out which be. I will submit a PR to add BE host to this later.
I wonder how you reproduce this error when debugging #4841 @morningman
It was killing me to find out which be.
That is an issue, better to and BE ip info in error log.
I wonder how you reproduce this error when debugging #4841 @morningman
1 FE and 4 BE. create a table with 1 bucket and 3 replica. insert some data, and query:
seletc * from (a join a) union all (a jion a);
@morningman Could you please be more specific? I have 1 FE and 5 BE, having sql below
CREATE DATABASE IF NOT EXISTS `test`; USE `test`;
DROP TABLE IF EXISTS `test_t`;
CREATE TABLE `test_t` (
`name` VARCHAR(1000) NULL COMMENT "string"
) ENGINE=OLAP
DUPLICATE KEY(`name`)
DISTRIBUTED BY HASH(`name`) BUCKETS 1
PROPERTIES (
"replication_num" = "3",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);
INSERT INTO test_t (`name`)
VALUES ('aaaaa'), ('ccccc')
;
select * from test_t a join test_t b union all select * from test_t c join test_t d;
I tried a lot of time but this sql never trigger the error.
Describe the bug
To Reproduce
When I delete any file in
be/storage/data
and launch a query, I'll getand it comes from https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/exec/olap_scanner.cpp#L117-L133
So I cannot directly reproduce it.
Troubleshoot
First,
_tablet
getsnullptr
and comes out this error. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/exec/olap_scanner.cpp#L73-L83Here
tablet = _get_tablet_unlocked(tablet_id, schema_hash);
must benullptr
and didn't get corrected later. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/olap/tablet_manager.cpp#L582-L610Here
tablet_map
cannot find thetablet
https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/olap/tablet_manager.cpp#L1409-L1428I have no idea about when this
_tablet_map_array
will be updated and why can't BE find the tablet. https://github.com/apache/incubator-doris/blob/f40868a4805f0ba503bcca9a9c04f704b61c121b/be/src/olap/tablet_manager.h#L199-L214