WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 69 forks source link

Hang during multi-worker crash + recovery #3109

Open slfritchie opened 4 years ago

slfritchie commented 4 years ago

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

Hang during multi-worker crash + recovery. See master-crasher.sh command below to run.

Last messages from each worker that match the regexp --|~~|RECOVERY|Recovery|recovery:

initializer:
1580343036.556997,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker1:
1580343036.555562,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker2:
1580343036.768260,|~~ - Recovery initiated at worker5. Ceding control. - ~~|
1580343036.768268,_RecoveryPhase transition to _NotRecovering
1580343036.768335,Sent control message to worker5: AckRecoveryInitiatedMsg
1580343036.768343,Recovering worker: Skipping Phase III
1580343036.768347,|~~ INIT PHASE IV: Cluster is ready to work! ~~|

worker3:
1580343036.558269,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker4:
1580343036.571309,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker5:
1580343036.771454,Received msg on Control Channel: AckRecoveryInitiatedMsg

Full logs are available at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1580344129.tar.gz

What is the expected behavior?

Successful restart after multi-worker crash & restart

What OS and version of Wallaroo are you using?

Ubuntu 16.04 LTS/Xenial and master branch @ commit master

Steps to reproduce?

cd /path/to/wallaroo/testing/correctness/scripts/effectively-once
make -C ../../../.. resilience=on  PONYCFLAGS="--verbose=1 -d -Dtrace -Dcheckpoint_trace -Didentify_routing_ids"     build-examples-pony-passthrough build-testing-tools-external_sender    
./master-crasher.sh 6 crash5 crash{0,1,2,3,4}.slow crash-sink no-ack-progress

I see this error happening every 1-3 hours, so it's rare. If you're short on disk space, then I recommend running the following to reduce the # of log files that consume space in /tmp.

while [ 1 ]; do echo -n `date` "" ; df -h /tmp; ls -t /tmp/wallaroo.*gz | sed 1,20d | xargs rm -f ; sleep 15; done