What is the bug?
I noticed after upgrading our Shuttle-powered app from Shuttle v5.10.0 to the latest v0.6.4 that our reconciliation/backfill jobs would hang indefinitely after a while. We use BullMQ to manage reconcile/backfills jobs very similar to the Shuttle example-app. The hanging jobs are stuck in the active state, but aren't doing anything. Nothing meaningful in our logs.
I saw there's a new streamFetch hub API being used for reconciliation as of Shuttle v0.6.1. Thinking that might be the problem, I forced it off by extending Shuttle's MessageReconciliation class within our app like so:
export class SelectiveMessageReconciliation extends MessageReconciliation {
constructor(client: HubRpcClient, db: DB, log: pino.Logger) {
super(client, db, log);
// Immediately close the stream connection because it appears to be buggy. Closing the stream causes Shuttle to
// fallback on the old method of fetching messages, which is less efficient but may be more reliable.
this.close();
}
The problem persisted after this, indicating it isn't the new streamFetch API per se at fault. Though, through trial-and-error I determined the buggy hanging behavior was still introduced with v0.6.1. I think there is something wrong with how promises/errors are handled in this commit such that backfill/reconciliation no longer automatically recovers from hub API connection failures like it used to.
How can it be reproduced? (optional)
I don't have rock solid repro steps unfortunately. I'm basing this off observations of our Shuttle-powered app's metrics as our hubs intermittently experienced unrelated failures while I swapped between different Shuttle versions. I'm guessing you could repro by doing this though:
Start Shuttle example-app in both streaming and backfills modes, pointing at some test hub.
Take down the test hub, purposefully causing the example-app's connections to it to start failing.
Bring back the test hub, and notice the example-app's streaming process automatically recovers but the backfill jobs don't. They hang indefinitely instead.
What is the bug? I noticed after upgrading our Shuttle-powered app from Shuttle v5.10.0 to the latest v0.6.4 that our reconciliation/backfill jobs would hang indefinitely after a while. We use BullMQ to manage reconcile/backfills jobs very similar to the Shuttle example-app. The hanging jobs are stuck in the
active
state, but aren't doing anything. Nothing meaningful in our logs.I saw there's a new
streamFetch
hub API being used for reconciliation as of Shuttle v0.6.1. Thinking that might be the problem, I forced it off by extending Shuttle'sMessageReconciliation
class within our app like so:The problem persisted after this, indicating it isn't the new
streamFetch
API per se at fault. Though, through trial-and-error I determined the buggy hanging behavior was still introduced with v0.6.1. I think there is something wrong with how promises/errors are handled in this commit such that backfill/reconciliation no longer automatically recovers from hub API connection failures like it used to.How can it be reproduced? (optional) I don't have rock solid repro steps unfortunately. I'm basing this off observations of our Shuttle-powered app's metrics as our hubs intermittently experienced unrelated failures while I swapped between different Shuttle versions. I'm guessing you could repro by doing this though: