WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 69 forks source link

Invalid call to checkpoint_complete on _CpRbOps state _CpRbWaitingForCheckpoint #3123

Open slfritchie opened 4 years ago

slfritchie commented 4 years ago

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

When Wallaroo is compiled with resilience=on and used with a connector sink, if Wallaroo is started before the sink is accepting connections, then Wallaroo will crash at its 2nd checkpoint.

1587415964.414633,DEBUG,ConnectorSink,ECTransition:: _ExtConnConnected -> _ExtConnTwoPCReady
1587415964.414637,DEBUG,2PC,_ExtConnTwoPCReady _advertise_status true
1587415964.414641,DEBUG,ConnectorSink,Send conn_ready to CpRb
1587415964.414646,DEBUG,ConnectorSink,_CpRbInit.conn_ready: acking CheckpointBarrierToken(2)
1587415964.414651,DEBUG,ConnectorSink,Send commit to barrier coordinator for CheckpointBarrierToken(2)
1587415964.414659,DEBUG,ConnectorSink,CpRbTransition:: _CpRbInit -> _CpRbWaitingForCheckpoint
1587415964.414663,DEBUG,ConnectorSink,open_valve
1587415964.414667,DEBUG,ConnectorSink,open_valve: already open
1587415964.414671,DEBUG,ConnectorSink,open_valve: size = 0
1587415964.414675,DEBUG,ConnectorSink,CpRbTransition:: _CpRbInit -> _CpRbWaitingForCheckpoint DONE
1587415964.414714,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(2) Complete
1587415964.414719,CheckpointInitiator: event_log_write_checkpoint_id()
1587415964.414793,EventLog: write_checkpoint_id 2!!
1587415964.414802,EventLog: Writing CheckpointId for CheckpointId 2
1587415964.414823,Transition to _WaitingForCheckpointInitiationEventLogPhase: checkpoint_id 3
1587415964.414840,_WaitingForEventLogIdWrittenPhase: event_log_id_written from initializer for 2
1587415964.415090,_WaitingForEventLogIdWrittenPhase: acked_workers: 1, workers: 1
1587415964.415100,CheckpointInitiator: Checkpoint 2 is complete!
1587415964.415104,Saving current CheckpointId 2 last complete CheckpointId 2 RollbackId 0
1587415964.415764,Fulfilling checkpoint promise
1587415964.415856,DEBUG,2PC,2PC: Checkpoint complete 2 _twopc.txn_id is 
1587415964.415864,CRITICAL,ConnectorSink,Invalid call to checkpoint_complete on _CpRbOps state _CpRbWaitingForCheckpoint
This should never happen: failure in /home/vagrant/wallaroo/lib/wallaroo/core/sink/connector_sink/connector_sink_checkpoint_rollback.pony at line 48

What is the expected behavior?

No crash

What OS and version of Wallaroo are you using?

Ubuntu 18.04/Bionic LTS + master branch @ commit 9b86161a2

Steps to reproduce?

cd testing/correctness/scripts/effectively-once
export WALLAROO_TOP=$HOME/wallaroo ; # Edit to fit your environment
export WALLAROO_BIN=$WALLAROO_TOP/examples/pony/passthrough/passthrough
export PATH=${PATH}:.
export PYTHONPATH=$WALLAROO_TOP/machida/lib
export WALLAROO_THRESHOLDS='*.8' # Turns on verbose logging @ debug level
make -C ../../../.. \
    PONYCFLAGS="--verbose=1 -d -Dresilience -Dtrace -Dcheckpoint_trace -Didentify_routing_ids" \
    build-examples-pony-passthrough build-testing-tools-external_sender \
    build-utils-cluster_shrinker build-utils-data_receiver \
    build-testing-tools-fixed_length_message_blaster

. ./sample-env-vars.sh
./reset.sh
./start-initializer.sh -n 1
sleep 3
$WALLAROO_TOP/testing/correctness/tests/aloc_sink/aloc_sink /tmp/sink-out.output /tmp/sink-out.abort 7200 > /tmp/sink-out.stdout-stderr 2>&1 &