WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 69 forks source link

Mute/unmute overhead interferes with checkpoint barriers #3120

Open slfritchie opened 4 years ago

slfritchie commented 4 years ago

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

Intermittent crash during checkpoint processing

What is the expected behavior?

No crash

What OS and version of Wallaroo are you using?

Ubuntu Bionic/18.04 LTS + Wallaroo @ commit 35d2038

Steps to reproduce?

See README.md in tarball at http://wallaroolabs-dev.s3.amazonaws.com/scott/count2.tar.gz. Instructions include options for building & running a demonstration test via a VM or Docker.

reset.sh
start-cluster.sh 4

... can occasionally yield a crash a few seconds after the start-cluster.sh script is finished. See full logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1583892856.tar.gz. On a 1 CPU/5GB RAM virtual machine, the crash seems to happen roughly 50% of the time.

The crash is more likely to happen as the cluster size is increased. The crash always seems to be during the 2nd checkpoint operation.

$ tail /tmp/wallaroo.2
1583892717.934412,Unmuting DataChannel
1583892717.934418,Unmuting DataChannel
1583892717.934425,Unmuting DataChannel
1583892717.934431,Unmuting DataChannel
1583892718.090417,Sent control message to initializer: EventLogAckCheckpointMsg
1583892718.091238,Sent control message to initializer: WorkerAckBarrierMsg
1583892718.102630,Sent control message to initializer: EventLogAckCheckpointIdWrittenMsg
1583892719.111070,ERROR,Step,Invariant violation: received barrier CheckpointBarrierToken(2) is greater than current barrier CheckpointBarrierToken(1) at Step 193591313640807744353045639962347611769

Invariant violated in /build2/.deps/wallaroolabs/wallaroo/lib/wallaroo/core/step/step_phase.pony at line 219