estuary / flow

🌊 Continuously synchronize the systems where your data lives, to the systems where you _want_ it to live, with Estuary Flow. 🌊
https://estuary.dev
Other
625 stars 53 forks source link

Flowctl test hangs when lambda throws an exception #123

Open psFried opened 3 years ago

psFried commented 3 years ago

Steps to reproduce:

Expected output is that the test should show as failed. Ideally it would print something like "publish lambda returned error: not implemented".

What's actually happening is that it logs the error, but then hangs indefinitely. You need to ctrl+c in order to stop it. I've attached the full test output, in addition to the goroutine traces, which I captured after the tests had hung, and the output of a gazctl shards list.

test-out.txt test-goroutines.txt test-shards.txt

psFried commented 3 years ago

I had some additional thoughts on a possible approach for fixing this. What we really want to know, at least for the purpose of running tests, is whether a shard has ever failed in the course of the test. My assumption (please correct if wrong) is that for a "normal" call to Stat a shard you're expecting that a failed shard will go through the "normal" lifecycle, and that that is not true for tests, where failed shards do not get re-started. It's possible that the Resolver should still account for this somehow, but I'm really not sure. If the current Resolver behavior is correct, then perhaps the test driver should use some other mechanism to watch for shard failures and surface them as test failures.

jgraettinger commented 3 years ago

There's no facility in Gazette for re-starting failed shards. Long ago that used to be the behavior, but that turned out to be an anti-pattern -- now, the expectation is that anything that's automatically retry-able is retried before you allow a top-level error to bubble up and fail the shard. So, failed shards are left as failed because they require operator intervention.

Yea, the behavior is approximately for Resolver to check whether the current assignment is failed, though my current confidence is only moderate until I've had a closer look at those code paths.

jgraettinger commented 3 years ago

FWIW, I see zero reason not to have Resolver universally error stats of failed shards -- this just seems an oversight.