Closed slfritchie closed 5 years ago
Note: there are two checkpoint #4s in this history, fun.
$ zegrep 'CheckpointInitiator: Checkpoint [0-9]* is complete|Injecting' tmp/wallaroo.0* | sort -t : -k 2 -n
tmp/wallaroo.0.1570725249.gz:1570725246.226691,Injecting barrier CheckpointBarrierToken(1)
tmp/wallaroo.0.1570725249.gz:1570725246.878493,CheckpointInitiator: Checkpoint 1 is complete!
tmp/wallaroo.0.1570725249.gz:1570725247.880961,Injecting barrier CheckpointBarrierToken(2)
tmp/wallaroo.0.1570725249.gz:1570725248.555408,CheckpointInitiator: Checkpoint 2 is complete!
tmp/wallaroo.0.1570725251.gz:1570725249.890674,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 1, Checkpoint 2)
tmp/wallaroo.0.1570725251.gz:1570725250.627333,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 1, Checkpoint 2)
tmp/wallaroo.0.1570725251.gz:1570725250.663256,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725257.gz:1570725252.307384,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 2, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725254.862264,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 2, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725254.898435,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 3, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725255.641009,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 3, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725255.660805,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725257.gz:1570725256.390821,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 4, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725257.130985,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 4, Checkpoint 2)
tmp/wallaroo.0.1570725257.gz:1570725257.145936,Injecting barrier CheckpointBarrierToken(4)
tmp/wallaroo.0.1570725260.gz:1570725257.678721,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 5, Checkpoint 2)
tmp/wallaroo.0.1570725260.gz:1570725258.404585,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 5, Checkpoint 2)
tmp/wallaroo.0.1570725260.gz:1570725258.438930,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725260.gz:1570725260.118152,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 6, Checkpoint 2)
tmp/wallaroo.0.1570725264.gz:1570725261.287522,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 7, Checkpoint 2)
tmp/wallaroo.0.1570725264.gz:1570725262.019931,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 7, Checkpoint 2)
tmp/wallaroo.0.1570725264.gz:1570725262.049630,Injecting barrier CheckpointBarrierToken(3)
tmp/wallaroo.0.1570725264.gz:1570725262.531042,CheckpointInitiator: Checkpoint 3 is complete!
tmp/wallaroo.0.1570725264.gz:1570725263.535424,Injecting barrier CheckpointBarrierToken(4)
tmp/wallaroo.0.1570725268.gz:1570725266.692689,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 8, Checkpoint 3)
tmp/wallaroo.0.1570725268.gz:1570725267.420322,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 8, Checkpoint 3)
tmp/wallaroo.0.1570725268.gz:1570725267.458989,Injecting barrier CheckpointBarrierToken(4)
tmp/wallaroo.0.1570725268.gz:1570725267.940966,CheckpointInitiator: Checkpoint 4 is complete!
tmp/wallaroo.0.1570725272.gz:1570725269.570508,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 9, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725270.302766,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 9, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725270.330978,Injecting barrier CheckpointBarrierToken(5)
tmp/wallaroo.0.1570725272.gz:1570725271.146685,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 10, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725271.884522,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 10, Checkpoint 4)
tmp/wallaroo.0.1570725272.gz:1570725271.922086,Injecting barrier CheckpointBarrierToken(6)
tmp/wallaroo.0.1570725272.gz:1570725272.409086,CheckpointInitiator: Checkpoint 6 is complete!
tmp/wallaroo.0:1570725273.421824,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 11, Checkpoint 6)
tmp/wallaroo.0:1570725274.149319,Injecting barrier CheckpointRollbackResumeBarrierToken(Rollback 11, Checkpoint 6)
tmp/wallaroo.0:1570725274.185290,Injecting barrier CheckpointBarrierToken(7)
And they have different points_of_reference/offsets, ouch.
1570725263.580785,DEBUG,2PC,2PC: Checkpoint state 4 at ConnectorSink 247008000284870596570318168641095230061, txn-id current_offset 105693 acked_point_of_ref 0 _pending_writev_total 0
1570725267.472202,DEBUG,2PC,2PC: Checkpoint state 4 at ConnectorSink 247008000284870596570318168641095230061, txn-id current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
And FileBackend.start_rollback()
fetches the first one that it finds in the eventlog file. The problem is that the first one may not have been committed. In this ticket's case, the first one as aborted.
Note also that checkpoint #3 appears 4 times in the event log: 1st-3rd are aborted by the global protocol, the 4th is globally committed. Later, we can see that there is eventually a roll back to a checkpoint #3.
1570725250.685479,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
1570725255.672551,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
1570725258.455899,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
1570725262.069520,DEBUG,2PC,2PC: Checkpoint state 3 at ConnectorSink 247008000284870596570318168641095230061, txn-id current_offset 50666 acked_point_of_ref 0 _pending_writev_total 0
Here's the rollback entry for CP 3, as logged by initializer
. Note the timestamp: it is 4.63 seconds after the 4th checkpoint entry was written.
1570725266.692689,Injecting blocking barrier CheckpointRollbackBarrierToken(Rollback 8, Checkpoint 3)
@jtfmumm: Possible fix @ commit 2739471f3, thoughts?
@slfritchie I think the problem could be that we never rethought checkpoint id assignment in light of the addition of checkpoint abort support. It seems to me that the checkpoint id should be incremented both in the success and failure case, but we are only doing it in the success case. I think we should look at both the checkpoint ids as signifying rounds (as we do with the rollback ids). So what we are calling _last_complete_checkpoint_id
in CheckpointInitiator
should mean the last completed round, not necessarily the last successful checkpoint. An abort completes a round, even if it doesn’t involve committing a checkpoint.
This means that we need to represent two separate concepts: (1) last successful checkpoint that we can roll back to and (2) current round id. So we'll need to have discussions around this before implementing a fix.
Seen again today on today's master branch at commit 4323b71 + connector-sink-fixes
branch/PR #3042 using ./master-crasher.sh 2 crash0 crash1 crash-sink
, see logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1573071315.tar.gz.
Noteworthy:
$ zgrep 'Checkpoint Barrier CheckpointBarrierToken.*Complete' /tmp/wallaroo.0.* /tmp/wallaroo.0
/tmp/wallaroo.0.1573071008.gz:1573071005.194525,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(63) Complete
/tmp/wallaroo.0.1573071008.gz:1573071006.715433,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(64) Complete
/tmp/wallaroo.0.1573071008.gz:1573071008.095921,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(66) Complete
/tmp/wallaroo.0.1573071013.gz:1573071011.160774,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(65) Complete
/tmp/wallaroo.0.1573071013.gz:1573071012.671681,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(66) Complete
/tmp/wallaroo.0 :1573071016.352944,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(67) Complete
/tmp/wallaroo.0 :1573071017.879442,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(68) Complete
Is this a bug, feature request, or feedback?
Bug
What is the current behavior?
Checkpoint #4 has been aborted at the ConnectorSink. Later, Wallaroo elects to roll back to checkpoint #4; the ConnectorSink panics when it receives an invalid message_id in a 2PC phase 1 message.
In the
CRITICAL
panic message below:What is the expected behavior?
Proper coordination of committed & aborted ConnectorSink transactions with Wallaroo global state.
What OS and version of Wallaroo are you using?
Ubuntu Xenial/16.04 LTS + aloc-tee-source+scripts+autoscale-checkpoint+tcp-source-sink branch @ commit abb7b23f9
Steps to reproduce?