azure-contrib / socket.io-servicebus

socket.io store which uses Service Bus pub/sub for scale out
Apache License 2.0
30 stars 14 forks source link

messagesequencer stalls forever when a single message is missing #28

Closed jcookems closed 11 years ago

jcookems commented 11 years ago

The messagesequencer assumes that there will be no gaps in the sequence of messages from the subscription for a given node. This allows it to ensure strong ordering.

However, this guarantee comes at a price: is a single message is missing from the stream, the sequencer will backup forever, consuming more and more memory.

I can think of two situations where this could happen:

  1. One node is already running, pumping messages into the topic. Then the user creates a new subscription for a new node. Given that Service Bus does not have strong ordering guarantees, the stream of messages from the topic might be 1,2,4,3,5,6. If the subscription is hooked up after message 4 leaves the topic, then the subscription would have these messages: 3,5,6,....
  2. Perhaps someone is debugging a live-site issue, and destructively reads one of the messages from the subscription. Then the subscription could have 3,5,6,...

Both cases seem unlikely, but they both have devistating consequences:

  1. Stalling the output of the sequencer, with no fix other than restart
  2. Memory user strictly increases until the process dies?
  3. The messages keep getting read and deleted from the subscription, so even if the node is restarted, the processed messages will be deleted.

Two fixes I can think of are:

  1. To fix the stalling, cap the size of the pendingMessages list to something like 10. After that point, we should assume that the message we are blocking on will not come, and just give up on it (but log the failure!), and move on to the next one.
  2. To fix the loss of data if stalling occurs, peek-lock the messages, and delete only after sending to SocketIO, but that doubles out network calls.
christav commented 11 years ago

Based on discussion with Service Bus team today, Service Bus does have strong ordering guarantees. How does this new information affect the concerns here?

If we do still need to implement a fix here, I'd lean towards a timeout of some sort - if an expected message doesn't come in after say 10 seconds, it's probably not going to.

glennblock commented 11 years ago

Not an issue