apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.98k stars 313 forks source link

share log calculated size is unreasonable or the shared log may be damaged #552

Open foreverneverer opened 4 years ago

foreverneverer commented 4 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
  1. What did you expect to see? The node server can be restarted and no any error

  2. What did you see instead?

    • the perfcounter report the shared log too large 25853163(MB) > 50000
    • the log show error as soon as when the node server restart:
      mutation_log.cpp:2057:read_next_log_block(): read data block body failed, size = 328 vs 676, err = ERR_HANDLE_EOF
      replica_stub.cpp:552:initialize():some shared log state must be lost, smax(1301076891) vs pmax(1301079680)
      replica_stub.cpp:565:initialize(): logs are not complete for some replicas, which means that shared log is truncated, mark all replicas as inactive
  3. What version of Pegasus are you using? pegasus-server-1.12.3-a948e89-glibc2.12-release.tar.gz

  4. Suggestion

    • suggest dessart instead of derror if the shared log is damaged when restart the node server
    • cleanup the node and then restart
neverchanje commented 4 years ago

What did you expect to see? The node server can be restarted and no any error.

What did you see instead? the perfcounter report the shared log too large 25853163(MB) > 50000 the log show error as soon as when the node server restart:

So what's the next result of a too-large-shared-log? Did it make the cluster unable to serve anymore? Or was the replica-server unable to restart?

neverchanje commented 4 years ago

I think we can consider mocking such case in replica's UT, by intentionally append some mutations only to the plog, without appending to the slog. Let the server restart then, and see what happens.