altangent / ccxws

WebSocket client for 38 cryptocurrency exchanges
MIT License
619 stars 187 forks source link

Discussion: Sequence ID validation race condition when requesting snapshot via REST #252

Closed evan-coygo closed 3 years ago

evan-coygo commented 3 years ago

For exchanges that request the l2 snapshot over REST there is a pretty common race condition issue that can lead to missing order book data.

The problem w/example

I'll use Bittrex as an example since the issue happens often for me with Bittrex. The culprit code looks like this:

 _sendSubLevel2Updates(remote_id, market) {
    this._requestLevel2Snapshot(market);
    this._wss.send(
      JSON.stringify({
        H: "c3",
        M: "Subscribe",
        A: [[`orderbook_${remote_id}_${this.orderBookDepth}`]],
        I: ++this._messageId,
      })
    );
  }

The issue is as follows:

  1. I call subscribeLevel2Updates()
  2. First it makes a REST request to get the snapshot via this._requestLevel2Snapshot(market)
  3. While that REST request is sent, the ws "Subscribe" event is sent to subscribe to l2 updates.
  4. The REST call gets its response first, and emits a l2snapshot event with sequenceId of "100".
  5. The ws stream for l2 updates starts arriving, with the first event having a sequenceId of "105".

The snapshot has sequenceId of "100" while the first update has a sequenceId of "105". This means you missed four updates and your order book will never be valid.

Solutions

bmancini55 commented 3 years ago

@evan-coygo, thanks for submitting the issue! Yeah, this is sort of a known issue that's shows its ugly head when you try to build certain books.

To rehash, the issue is that the socket stream starts after the sequenceId returned by the REST API. In many cases I think this is due to request caching of the REST API. So you end up with a slightly stale snapshot (lower sequenceId) than you have obtained via the socket stream. It could also just be that establishing socket subscriptions takes longer than the REST request for some exchanges. I digress on the cause.

From a code perspective, to avoid this issue, you need to perform the snapshot after the subscription has been confirmed and you've queued a sufficient number of messages (intentionally ambiguous as the sufficient depth is likely exchange and latency specific).

In the past I've recommend monkey patching a timestamp delay over _requestLevel2Snapshot(market) for the client:

const REST_DELAY_MS = 500
client._originalRequestLevel2Snapshot = client._requestLevel2Snapshot;
client._requestLevel2Snapshot = market => setTimeout(() => client._originalRequestLevel2Snapshot(market), REST_DELAY_MS);

However, this solution isn't foolproof and may introduce some funkiness on reconnections.

Ideally (as usually), changes after the refactor (#149) which will include subscription success/failure (#103) combined with firing the snapshot request after a delay after subscription success would be best.

As a fallback though, I think you need to be prepared to call _requestLevel2Snapshot if you receive a snapshot older than your update stream. Which totally should certainly be documented!

evan-coygo commented 3 years ago

Alright thanks for the monkey-patch recommendation. I've added that as a temp fix and it seems to work okay-ish, albeit without a guarantee of success

evan-coygo commented 3 years ago

This seems like it'll be around for awhile so I've added it to the README in a PR #253

bmancini55 commented 3 years ago

Awesome thanks! Will take a look shortly.

Aside: this would be a good issue to convert to a Discussion. Not exactly sure how though... the "convert to issue" link seems to have disappeared. Way to go GitHub haha.

evan-coygo commented 3 years ago

I'll be honest, I didn't know that was a thing. Great idea though to avoid clutter in Issues. Discussion created here https://github.com/altangent/ccxws/discussions/255, I'm closing this issue

bmancini55 commented 3 years ago

Cool thanks! In some weird confluence of events, the "Convert to Discussion" button is back. Weird.