Hang during multi-worker crash + recovery. See master-crasher.sh command below to run.
Last messages from each worker that match the regexp --|~~|RECOVERY|Recovery|recovery:
initializer:
1580343036.556997,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!
worker1:
1580343036.555562,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!
worker2:
1580343036.768260,|~~ - Recovery initiated at worker5. Ceding control. - ~~|
1580343036.768268,_RecoveryPhase transition to _NotRecovering
1580343036.768335,Sent control message to worker5: AckRecoveryInitiatedMsg
1580343036.768343,Recovering worker: Skipping Phase III
1580343036.768347,|~~ INIT PHASE IV: Cluster is ready to work! ~~|
worker3:
1580343036.558269,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!
worker4:
1580343036.571309,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!
worker5:
1580343036.771454,Received msg on Control Channel: AckRecoveryInitiatedMsg
Successful restart after multi-worker crash & restart
What OS and version of Wallaroo are you using?
Ubuntu 16.04 LTS/Xenial and master branch @ commit master
Steps to reproduce?
cd /path/to/wallaroo/testing/correctness/scripts/effectively-once
make -C ../../../.. resilience=on PONYCFLAGS="--verbose=1 -d -Dtrace -Dcheckpoint_trace -Didentify_routing_ids" build-examples-pony-passthrough build-testing-tools-external_sender
./master-crasher.sh 6 crash5 crash{0,1,2,3,4}.slow crash-sink no-ack-progress
I see this error happening every 1-3 hours, so it's rare. If you're short on disk space, then I recommend running the following to reduce the # of log files that consume space in /tmp.
while [ 1 ]; do echo -n `date` "" ; df -h /tmp; ls -t /tmp/wallaroo.*gz | sed 1,20d | xargs rm -f ; sleep 15; done
Is this a bug, feature request, or feedback?
Bug
What is the current behavior?
Hang during multi-worker crash + recovery. See
master-crasher.sh
command below to run.Last messages from each worker that match the regexp
--|~~|RECOVERY|Recovery|recovery
:Full logs are available at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1580344129.tar.gz
What is the expected behavior?
Successful restart after multi-worker crash & restart
What OS and version of Wallaroo are you using?
Ubuntu 16.04 LTS/Xenial and master branch @ commit master
Steps to reproduce?
I see this error happening every 1-3 hours, so it's rare. If you're short on disk space, then I recommend running the following to reduce the # of log files that consume space in /tmp.