farcasterxyz / hub-monorepo

Implementation of the Farcaster Hub specification and supporting libraries for building applications on Farcaster
https://www.thehubble.xyz
MIT License
700 stars 392 forks source link

Shuttle reconciliation hangs indefinitely in versions >= 0.6.1 #2346

Open tybook opened 1 day ago

tybook commented 1 day ago

What is the bug? I noticed after upgrading our Shuttle-powered app from Shuttle v5.10.0 to the latest v0.6.4 that our reconciliation/backfill jobs would hang indefinitely after a while. We use BullMQ to manage reconcile/backfills jobs very similar to the Shuttle example-app. The hanging jobs are stuck in the active state, but aren't doing anything. Nothing meaningful in our logs.

I saw there's a new streamFetch hub API being used for reconciliation as of Shuttle v0.6.1. Thinking that might be the problem, I forced it off by extending Shuttle's MessageReconciliation class within our app like so:

export class SelectiveMessageReconciliation extends MessageReconciliation {
  constructor(client: HubRpcClient, db: DB, log: pino.Logger) {
    super(client, db, log);
    // Immediately close the stream connection because it appears to be buggy. Closing the stream causes Shuttle to
    // fallback on the old method of fetching messages, which is less efficient but may be more reliable.
    this.close();
  }

The problem persisted after this, indicating it isn't the new streamFetch API per se at fault. Though, through trial-and-error I determined the buggy hanging behavior was still introduced with v0.6.1. I think there is something wrong with how promises/errors are handled in this commit such that backfill/reconciliation no longer automatically recovers from hub API connection failures like it used to.

How can it be reproduced? (optional) I don't have rock solid repro steps unfortunately. I'm basing this off observations of our Shuttle-powered app's metrics as our hubs intermittently experienced unrelated failures while I swapped between different Shuttle versions. I'm guessing you could repro by doing this though:

  1. Start Shuttle example-app in both streaming and backfills modes, pointing at some test hub.
  2. Take down the test hub, purposefully causing the example-app's connections to it to start failing.
  3. Bring back the test hub, and notice the example-app's streaming process automatically recovers but the backfill jobs don't. They hang indefinitely instead.