share log calculated size is unreasonable or the shared log may be damaged

foreverneverer commented 4 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do?

One replica-server was down, we manually re-added it into the cluster.
Run the following commands to add the node:
- remote-command -t meta-server meta.lb.only_move_primary true
- set_meta_level lively

What did you expect to see? The node server can be restarted and no any error

What did you see instead?

the perfcounter report the shared log too large 25853163(MB) > 50000

the log show error as soon as when the node server restart:

mutation_log.cpp:2057:read_next_log_block(): read data block body failed, size = 328 vs 676, err = ERR_HANDLE_EOF
replica_stub.cpp:552:initialize():some shared log state must be lost, smax(1301076891) vs pmax(1301079680)
replica_stub.cpp:565:initialize(): logs are not complete for some replicas, which means that shared log is truncated, mark all replicas as inactive

What version of Pegasus are you using? pegasus-server-1.12.3-a948e89-glibc2.12-release.tar.gz
Suggestion
- suggest dessart instead of derror if the shared log is damaged when restart the node server
- cleanup the node and then restart

neverchanje commented 4 years ago

What did you expect to see? The node server can be restarted and no any error.

What did you see instead? the perfcounter report the shared log too large 25853163(MB) > 50000 the log show error as soon as when the node server restart:

So what's the next result of a too-large-shared-log? Did it make the cluster unable to serve anymore? Or was the replica-server unable to restart?

neverchanje commented 4 years ago

I think we can consider mocking such case in replica's UT, by intentionally append some mutations only to the plog, without appending to the slog. Let the server restart then, and see what happens.

apache / incubator-pegasus

share log calculated size is unreasonable or the shared log may be damaged #552

Bug Report