WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 68 forks source link

Wallaroo rolls back to checkpoint previously aborted at ConnectorSink #3032

Closed slfritchie closed 5 years ago

slfritchie commented 5 years ago

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

Checkpoint #4 has been aborted at the ConnectorSink. Later, Wallaroo elects to roll back to checkpoint #4; the ConnectorSink panics when it receives an invalid message_id in a 2PC phase 1 message.

In the CRITICAL panic message below:

What is the expected behavior?

Proper coordination of committed & aborted ConnectorSink transactions with Wallaroo global state.

What OS and version of Wallaroo are you using?

Ubuntu Xenial/16.04 LTS + aloc-tee-source+scripts+autoscale-checkpoint+tcp-source-sink branch @ commit abb7b23f9

Steps to reproduce?

$ cd /top/of/wallaroo/repo
$ git checkout -f 7ab14750cd ; echo We need the passthrough app from another branch
$ mv examples/pony/passthrough 
$ git checkout -f COMMIT-ABOVE
$ mv passthrough examples/pony
$ dd if=$HOME/wallaroo/testing/data/market_spread/nbbo/r3k-symbols_nbbo-fixish.msg bs=1000000 count=1 | od -x | sed 's/^/T/' | sed -n '1,/T3641060/p' | perl -ne 'print "\0\0\0"; print "1"; print' > /tmp/input-file.txt
$ cd testing/correctness/scripts/effectively-once
$ make -C ../../../..     PONYCFLAGS="--verbose=1 -d -Dresilience -Dtrace -Dcheckpoint_trace -Didentify_routing_ids"     build-examples-pony-passthrough build-testing-tools-external_sender     build-utils-cluster_shrinker build-utils-data_receiver     build-testing-tools-fixed_length_message_blaster

$ ./master-crasher.sh 4 crash2 crash0
...
,,c0,r0,c0,r0c2,,r2,,c2r2,c0r0,c2,r2c0,r0,,,c2c0,,r2r0,,c0,r0,c2,r2,c0,c2r2r0,,
2019-10-10 16:34:34,147 CRITICAL [aloc_sink:434] 2PC: Phase 1 invalid end_por 115689 self._out_offset 50666 file size 50666 msg2 TwoPCPhase1(txn_id='rollback--Passthrough-w-worker3-id-247008000284870596570318168641095230061:c_id=6',where_list=[(1, 105693, 115689)])
SANITY
SANITY LOOP FAILURE: pause the world
slfritchie commented 5 years ago

See full logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.20191009.1570725284.tar.gz

slfritchie commented 5 years ago

Note: there are two checkpoint #4s in this history, fun.

$ zegrep 'CheckpointInitiator: Checkpoint [0-9]* is complete|Injecting' tmp/wallaroo.0* | sort -t : -k 2 -n
tmp/wallaroo.0.1570725249.gz:1570725246.226691,Injecting barrier CheckpointBarrierToken(1)
tmp/wallaroo.0.1570725249.gz:1570725246.878493,CheckpointInitiator: Checkpoint 1 is complete!
tmp/wallaroo.0.1570725249.gz:1570725247.880961,Injecting barrier CheckpointBarrierToken(2)
tmp/wallaroo.0.1570725249.gz:1570725248.555408,CheckpointInitiator: Checkpoint 2 is complete!
tmp/wallaroo.0.1570725251.gz:1570725249.890674,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 1, Checkpoint 2)
tmp/wallaroo.0.1570725251.gz:1570725250.627333,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 1, Checkpoint 2)
tmp/wallaroo.0.1570725251.gz:1570725250.663256,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725257.gz:1570725252.307384,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 2, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725254.862264,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 2, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725254.898435,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 3, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725255.641009,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 3, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725255.660805,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725257.gz:1570725256.390821,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 4, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725257.130985,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 4, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725257.145936,Injecting barrier CheckpointBarrierToken(4)
tmp/wallaroo.0.1570725260.gz:1570725257.678721,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 5, Checkpoint 2)
tmp/wallaroo.0.1570725260.gz:1570725258.404585,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 5, Checkpoint 2)
tmp/wallaroo.0.1570725260.gz:1570725258.438930,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725260.gz:1570725260.118152,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 6, Checkpoint 2)
tmp/wallaroo.0.1570725264.gz:1570725261.287522,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 7, Checkpoint 2)
tmp/wallaroo.0.1570725264.gz:1570725262.019931,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 7, Checkpoint 2)
tmp/wallaroo.0.1570725264.gz:1570725262.049630,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725264.gz:1570725262.531042,CheckpointInitiator: Checkpoint 3 is complete!
tmp/wallaroo.0.1570725264.gz:1570725263.535424,Injecting barrier CheckpointBarrierToken(4)
tmp/wallaroo.0.1570725268.gz:1570725266.692689,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 8, Checkpoint 3)
tmp/wallaroo.0.1570725268.gz:1570725267.420322,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 8, Checkpoint 3)
tmp/wallaroo.0.1570725268.gz:1570725267.458989,Injecting barrier CheckpointBarrierToken(4)
tmp/wallaroo.0.1570725268.gz:1570725267.940966,CheckpointInitiator: Checkpoint 4 is complete!
tmp/wallaroo.0.1570725272.gz:1570725269.570508,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 9, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725270.302766,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 9, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725270.330978,Injecting barrier CheckpointBarrierToken(5)
tmp/wallaroo.0.1570725272.gz:1570725271.146685,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 10, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725271.884522,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 10, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725271.922086,Injecting barrier CheckpointBarrierToken(6)
tmp/wallaroo.0.1570725272.gz:1570725272.409086,CheckpointInitiator: Checkpoint 6 is complete!
tmp/wallaroo.0:1570725273.421824,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 11, Checkpoint 6)
tmp/wallaroo.0:1570725274.149319,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 11, Checkpoint 6)
tmp/wallaroo.0:1570725274.185290,Injecting barrier CheckpointBarrierToken(7)

And they have different points_of_reference/offsets, ouch.

1570725263.580785,DEBUG,2PC,2PC: Checkpoint state 4 at ConnectorSink 247008000284870596570318168641095230061, txn-id  current_offset 105693 acked_point_of_ref 0 _pending_writev_total 0
1570725267.472202,DEBUG,2PC,2PC: Checkpoint state 4 at ConnectorSink 247008000284870596570318168641095230061, txn-id  current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0

And FileBackend.start_rollback() fetches the first one that it finds in the eventlog file. The problem is that the first one may not have been committed. In this ticket's case, the first one as aborted.

slfritchie commented 5 years ago

Note also that checkpoint #3 appears 4 times in the event log: 1st-3rd are aborted by the global protocol, the 4th is globally committed. Later, we can see that there is eventually a roll back to a checkpoint #3.

1570725250.685479,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id  current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
1570725255.672551,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id  current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
1570725258.455899,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id  current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
1570725262.069520,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id  current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0

Here's the rollback entry for CP 3, as logged by initializer. Note the timestamp: it is 4.63 seconds after the 4th checkpoint entry was written.

1570725266.692689,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 8, Checkpoint 3)
slfritchie commented 5 years ago

@jtfmumm: Possible fix @ commit 2739471f3, thoughts?

jtfmumm commented 5 years ago

@slfritchie I think the problem could be that we never rethought checkpoint id assignment in light of the addition of checkpoint abort support. It seems to me that the checkpoint id should be incremented both in the success and failure case, but we are only doing it in the success case. I think we should look at both the checkpoint ids as signifying rounds (as we do with the rollback ids). So what we are calling _last_complete_checkpoint_id in CheckpointInitiator should mean the last completed round, not necessarily the last successful checkpoint. An abort completes a round, even if it doesn’t involve committing a checkpoint.

jtfmumm commented 5 years ago

This means that we need to represent two separate concepts: (1) last successful checkpoint that we can roll back to and (2) current round id. So we'll need to have discussions around this before implementing a fix.

slfritchie commented 5 years ago

Seen again today on today's master branch at commit 4323b71 + connector-sink-fixes branch/PR #3042 using ./master-crasher.sh 2 crash0 crash1 crash-sink, see logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1573071315.tar.gz.

Noteworthy:

$ zgrep 'Checkpoint Barrier CheckpointBarrierToken.*Complete' /tmp/wallaroo.0.* /tmp/wallaroo.0
/tmp/wallaroo.0.1573071008.gz:1573071005.194525,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(63) Complete
/tmp/wallaroo.0.1573071008.gz:1573071006.715433,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(64) Complete
/tmp/wallaroo.0.1573071008.gz:1573071008.095921,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(66) Complete
/tmp/wallaroo.0.1573071013.gz:1573071011.160774,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(65) Complete
/tmp/wallaroo.0.1573071013.gz:1573071012.671681,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(66) Complete
/tmp/wallaroo.0              :1573071016.352944,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(67) Complete
/tmp/wallaroo.0              :1573071017.879442,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(68) Complete