GreptimeTeam / greptimedb

An open-source, cloud-native, unified time series database for metrics, logs and events with SQL/PromQL supported. Available on GreptimeCloud.
https://greptime.com/
Apache License 2.0
4.2k stars 301 forks source link

Failed to open region: Empty region directory #4127

Open evenyag opened 3 months ago

evenyag commented 3 months ago

What type of bug is this?

Crash

What subsystems are affected?

Distributed Cluster

Minimal reproduce step

  1. Create a table with multiple regions.
  2. Drop the table
  3. While dropping the table, restart the cluster

What did you expect to see?

The cluster is restarted

What did you see instead?

Datanode crashes.

What operating system did you use?

Unrelated

What version of GreptimeDB did you use?

0.8.2

Relevant log output and stack trace

2024-06-10T13:01:56+08:00   2024-06-10T05:01:56.804391Z  INFO RegionServerRequestHandler::handle:RegionServer::handle_region_request{region_id="924401516150784(215229, 0)"}:handle_request{request_type="Create"}: datanode::region_server: Region 924401516150784(215229, 0) is registered to engine mito
2024-06-10T13:01:56+08:00   2024-06-10T05:01:56.816112Z  INFO RegionServerRequestHandler::handle:RegionServer::handle_region_request{region_id="924401516150785(215229, 1)"}:handle_request{request_type="Create"}: datanode::region_server: Region 924401516150785(215229, 1) is registered to engine mito
2024-06-10T13:01:56+08:00   2024-06-10T05:01:56.843830Z  INFO RegionServerRequestHandler::handle:RegionServer::handle_region_request{region_id="924401516150787(215229, 3)"}:handle_request{request_type="Create"}: datanode::region_server: Region 924401516150787(215229, 3) is registered to engine mito
2024-06-10T13:01:56+08:00   2024-06-10T05:01:56.866768Z  INFO RegionServerRequestHandler::handle:RegionServer::handle_region_request{region_id="924401516150786(215229, 2)"}:handle_request{request_type="Create"}: datanode::region_server: Region 924401516150786(215229, 2) is registered to engine mito

2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.394926Z  INFO mito2::worker::handle_drop: Try to drop region: 924401516150787(215229, 3)
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.395009Z  INFO mito2::worker::handle_drop: Try to drop region: 924401516150785(215229, 1)
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.394999Z  INFO mito2::worker::handle_drop: Try to drop region: 924401516150784(215229, 0)

2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.412498Z  INFO mito2::region: Stopped region manifest manager, region_id: 924401516150784(215229, 0)
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.412554Z  INFO mito2::worker::handle_drop: Region 924401516150784(215229, 0) is dropped logically, but some files are not deleted yet
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.412729Z  INFO RegionServerRequestHandler::handle:RegionServer::handle_region_request{region_id="924401516150784(215229, 0)"}:handle_request{request_type="Drop"}: datanode::region_server: Region 924401516150784(215229, 0) is deregistered from engine mito
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.412822Z  INFO datanode::alive_keeper: Deregister alive countdown for region 924401516150784(215229, 0)
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.418047Z  INFO mito2::region: Stopped region manifest manager, region_id: 924401516150787(215229, 3)
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.418183Z  INFO mito2::worker::handle_drop: Region 924401516150787(215229, 3) is dropped logically, but some files are not deleted yet
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.418302Z  INFO RegionServerRequestHandler::handle:RegionServer::handle_region_request{region_id="924401516150787(215229, 3)"}:handle_request{request_type="Drop"}: datanode::region_server: Region 924401516150787(215229, 3) is deregistered from engine mito
2024-06-10T13:01:57+08:00   2024-06-10T05:01:57.418456Z  INFO datanode::alive_keeper: Deregister alive countdown for region 924401516150787(215229, 3)
2024-06-10T13:03:19+08:00   2024-06-10T05:03:19.440529Z  INFO mito2::worker::handle_open: Region 924401516150785(215229, 1) is dropped, result: Ok(true)

2024-06-10T13:03:19+08:00   2024-06-10T05:03:19.452182Z  INFO mito2::worker::handle_open: Try to open region 924401516150786(215229, 2)
2024-06-10T13:03:41+08:00   2024-06-10T05:03:41.399459Z  INFO mito2::worker::handle_open: Try to open region 924401516150785(215229, 1)
2024-06-10T13:03:41+08:00   1: Failed to handle request for region 924401516150785(215229, 1), at /home/runner/work/greptimedb/greptimedb/greptimedb/src/datanode/src/region_server.rs:497:14

Error: 0: Failed to start datanode, at /home/runner/work/greptimedb/greptimedb/greptimedb/src/cmd/src/datanode.rs:277:14
1: Failed to handle request for region 924401516150785(215229, 1), at /home/runner/work/greptimedb/greptimedb/greptimedb/src/datanode/src/region_server.rs:497:14
2: Failed to open region, at /home/runner/work/greptimedb/greptimedb/greptimedb/src/mito2/src/worker/handle_open.rs:133:54
3: Empty region directory, region_id: 924401516150785(215229, 1), region_dir: data/xxxxxx/public/215229/215229_0000000001/, at /home/runner/work/greptimedb/greptimedb/greptimedb/src/mito2/src/region/opener.rs:256:14
evenyag commented 1 month ago

The procedure drops the table and removes some of the region directories. But while restarting the cluster, the procedure may roll back and restore the metadata. Then the datanode will open the dropped directory and unable to start.

sunng87 commented 1 month ago

We can log this error and let the server starts. It would be nice to have a background job to cleanup the state.