apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.96k stars 310 forks source link

Duplication get stuck at the status DS_APP while duplicating with checkpoints #2025

Closed empiredan closed 1 month ago

empiredan commented 1 month ago

Firstly create a table named test_dup1 and write 2 records into it at the source cluster of duplication:

>>> use test_dup1
OK
>>> full_scan
partition: all
hash_key_filter_type: no_filter
sort_key_filter_type: no_filter
value_filter_type: no_filter
max_count: -1
timout_ms: 5000
detailed: false
no_value: false

"abc" : "def" => "ghi"
"1" : "2" => "3"

2 key-value pairs got.

Then, still at the source cluster of duplication, add a new duplication with checkpoints and specified remote table name and replica count:

>>> add_dup test_dup1 target_cluster -s -a new1_test -r 3 
trying to add duplication [app_name: test_dup1, remote_cluster_name: target_cluster, is_duplicating_checkpoint: true, remote_app_name: new1_test, remote_replica_count: 3]
adding duplication succeed [app_name: test_dup1, remote_cluster_name: target_cluster, appid: 5, dupid: 1716542517, checkpoint: true, remote_app_name: new1_test, remote_replica_count: 3]

After a long time, this duplication is still found at the status DS_APP (see https://pegasus.apache.org/administration/duplication for details):

>>> query_dup test_dup1 -d
duplications of app [test_dup1] in detail:
{"1":{"create_ts":"2024-05-24 17:21:57","dupid":1716542517,"fail_mode":"FAIL_SLOW","remote":"target_cluster","remote_app_name":"new1_test","remote_replica_count":3,"status":"DS_APP"},"appid":5}

The meta server of the source cluster has the following error log:

E2024-05-24 17:38:41.263 (1716543521263163609 1dc5)   meta.meta_state0.0103000000000122: meta_duplication_service.cpp:579:operator()(): query follower app[target_cluster.test_dup1] replica configuration completed, result: duplication_status = DS_APP, query_err = ERR_OBJECT_NOT_FOUND, update_err = ERR_NO_NEED_OPERATE
empiredan commented 1 month ago

This issue has been fixed by https://github.com/apache/incubator-pegasus/pull/2026.