apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.96k stars 310 forks source link

Bug(duplication):some nodes never start GC plog after computer room failure #2015

Open ninsmiracle opened 1 month ago

ninsmiracle commented 1 month ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? The computer room which service for our duplication master cluster meet an accidents. And most of the node in this room shutdown in a short time. When all the nodes alive , we found some partition of the duplication table never GC private log (plog) again.

  2. What did you expect to see? All the partition can GC it's plog correctly.

  3. What did you see instead? stdout (error log):

    // stdout
    90146:E2024-05-14 15:59:52.512 (1715673592512665104 67086) replica.default8.040005fe0319646c: nfs_server_impl.cpp:221:on_get_file_size(): {nfs_service} get stat of file /home/work/ssd2/pegasus/alsgsrv-monetization-master/replica/reps/8.53.pegasus/plog/log.18129.608864535790 failed, err = No such file or directory

    We can see this replica request a old plog. image

Because the partition can not clear plog as nomarl,so the disk always fully. We need to clear the plog sometimes.

  1. What version of Pegasus are you using? Pegasus v2.4