Closed slfritchie closed 4 years ago
The reconstruct_input_producer_barrier_events()
method in BarrierSinkPhase
is bogus: it can reorder app msgs wrt barrier tokens ... and I've now witnessed that exact re-ordering happening, derp. I definitely need to rip it out, first thing tomorrow morning.
@jtfmumm This branch is near ready: I'm looking at an intermittent failure right now, but it has had a big overhaul of how the phase buffering is done. This PR replaces a lot of bad code that (alas for me) worked most of the time, so even if it isn't 100% perfect, it's far better than today's master branch. Unless there's something terrible lurking in here, I think I can work on the intermittent failures in separate bugs & PRs.
The unit test failure for commit ccbe96e is apparently due to the timer that I'd added to Step
, ouch. I'll put an ifdef
around the timer's setup so that normal unit test compilation won't have to see it.
Re-implements the ConnectorSink + 2-Phase Commit protocol via two new FSMs, the "external connection operations" FSM and the "checkpoint/rollback operations" FSM.
_ExtConnOps: aka, "external connection operations". This trait describes the FSM used to manage mid-level details of this sink's TCP connection, e.g., disconnected, connected but not yet ready to send application data to the external sink, connection is fully operational. See the FSM state diagram on the righthand side of connector-sink-2pc-management.png.
_CpRbOps: aka, "checkpoint/rollback operations". This trait describes the FSM used to manage high-level operation of this sink and Wallaroo's overall status (e.g., starting, rolling back, running) together with _ExtConnOps's mid-level management of the TCP connection. See the FSM state diagram on the lefthand side of connector-sink-2pc-management.png.
Partial fix for bug #3097 Fixes #3086 Fixes #3031 Perhaps addresses #2878 Fixes #2814