Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.86k stars 2.94k forks source link

Alluxio TableMaster fails to recognize `remove_table` entry during journal replaying #12627

Closed hycsam closed 3 years ago

hycsam commented 3 years ago

Alluxio Setup: Version: Local build from latest master branch, 2.5.0-SNAPSHOT Cluster mode: Single master, with Alluxio workers colocated with Presto worker nodes

Describe the bug When Alluxio master is restarted, it fails to recognize the remove_table journal entry during journal replaying, and stuck there forever. Here's the log segment,

2020-12-10 23:07:20,291 ERROR UfsJournalCheckpointThread - Journal replay error: TableMaster: Unrecognized journal entry: sequence_number: 14814
remove_table {
  db_name: "DB_NAME"
  table_name: "TABLE_NAME"
  version: 1
}

2020-12-10 23:07:20,293 INFO  AbstractMaster - TableMaster: Stopped secondary master.
2020-12-10 23:07:20,293 INFO  AbstractMaster - MetaMaster: Stopped secondary master.
2020-12-10 23:07:20,293 INFO  AbstractMaster - FileSystemMaster: Stopped secondary master.
2020-12-10 23:07:20,293 INFO  AbstractMaster - BlockMaster: Stopped secondary master.
2020-12-10 23:07:20,293 INFO  AbstractMaster - MetricsMaster: Stopped secondary master.

To Reproduce Restart Alluxio master after some table delete operation (Happened when syncing with Hive, some hive tables got dropped)

Expected behavior TableMaster should recognize and be able to handle remove_table entry, and journal replaying successfully finishes

Urgency Blocker, since otherwise restarting Alluxio master would crash the cluster, and force format journal.

hycsam commented 3 years ago

/assign @hycsam