WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 69 forks source link

The aloc_sink test needs reimplementation #3108

Closed slfritchie closed 4 years ago

slfritchie commented 4 years ago

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

The aloc_sink test can fail due to an intermittent false positive on a correctness check.

In the new implementation of ConnectorSink, there is a mandatory requirement to abort a checkpoint if the external sink is disconnected. This requirement helps to create a race condition between when the aloc_sink test triggers a sink abort, when Wallaroo schedules a checkpoint, and when ConnectorSink manages to re-connect to the external sink.

The result is that an unpredicable number of checkpoints may be aborted in between the checkpoints that the aloc_sink test is expecting must be aborted, e.g, transaction IDs 6 and 9. If the race described above several times in a row, then Wallaroo can be forced to abort several checkpoints that are not visible to the sink's 2PC protocol.

For example, after checkpoint 6, the races force aborting of checkpoint 7, 8, and 9. From the sink's perspective, those aborts aren't visible because 2PC commands weren't sent for those checkpoints due to disconnection.

[1580334074.5931427, '1-rollback', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=3', [(1, 172578, 456114)]]
[1580334074.7471983, 'connection-closed', True]
[1580334074.8502636, 'list-uncommitted', ['Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=3']]
[1580334074.855151, '2-rollback', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=3', 172578]
[1580334074.858049, '1-ok', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=4', [(1, 172578, 172674)]]
[1580334075.0124912, '2-ok', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=4', 172674]
[1580334076.3192158, '1-ok', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=5', [(1, 172674, 424130)]]
[1580334076.4715385, '2-ok', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=5', 424130]
[1580334077.8488967, '1-rollback', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=6', [(1, 424130, 724287)]]
[1580334077.849295, 'close', 'before reply']
[1580334077.8501008, 'connection-closed', True]
[1580334077.9540782, 'list-uncommitted', ['Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=6']]
[1580334077.9643042, 'connection-closed', True]
[1580334078.0686924, 'list-uncommitted', ['Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=6']]
[1580334078.0752273, 'connection-closed', True]
[1580334078.1799817, 'list-uncommitted', ['Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=6']]
[1580334078.1840327, 'connection-closed', True]
[1580334078.2877395, 'list-uncommitted', ['Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=6']]
[1580334078.292062, '2-rollback', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=6', 424130]
[1580334078.2936628, '1-ok', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=10', [(1, 424130, 424226)]]
[1580334078.4538226, '2-ok', 'Celsius to Fahrenheit-w-initializer-id-302805485207444854316780936697119742567:c_id=10', 424226]

From Wallaroo's point of view, here is confirmation that checkpoints 7, 8, and 9 were indeed aborted.

1580334077.963814,INFO,ConnectorSink,Checkpoint CheckpointBarrierToken(7) arrived at _CpRbRollingBackResumed.cp_barrier_complete, aborting it
1580334078.074852,INFO,ConnectorSink,Checkpoint CheckpointBarrierToken(8) arrived at _CpRbRollingBackResumed.cp_barrier_complete, aborting it
1580334078.183680,INFO,ConnectorSink,Checkpoint CheckpointBarrierToken(9) arrived at _CpRbRollingBackResumed.cp_barrier_complete, aborting it

Meanwhile, the aloc_sink test is expecting that checkpoint 9 will be aborted via 2PC. This expectation is not met, so the test fails.

We need a new test implementation that is decoupled from exact checkpoint numbers. In the course of refactoring code for PR #3103, I'd realized that expected checkpoint numbers that commit would sometimes cause a false test failure. I now understand that those intermittent failures are caused by the same race, and my work-arounds for the test are not 100% deterministic.

What is the expected behavior?

Deterministic test execution.

What OS and version of Wallaroo are you using?

Ubuntu Linux 16.04/LTS + today's master at commit 510ec6fb7

Steps to reproduce?

make integration-tests-testing-correctness-tests-aloc_sink debug=true resilience=on