dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
9.94k stars 2.02k forks source link

Streams - no events are being sent to an explicit consumer after rewinding #6319

Open yonatan-d-sap opened 4 years ago

yonatan-d-sap commented 4 years ago

Using v2.4.3, we seem to experience an issue with an identical scenario to issue #901, only now it happens with explicit consumers.

I think it's caused by change of code in the pulling agent (PersistentStreamPullingAgent) which now has the following code in AddSubscriber_Impl:

    if (!pubSubCache.TryGetValue(streamId, out streamDataCollection))
    {    
        // If stream is not in pubsub cache, then we've received no events on this stream, and will  acquire the subscriptions from pubsub when we do.
        return;  
    }

This is a change from version 1.3, where in this case the consumer was added to the pub sub cache and then RunConsumerCursor() would be called, so even if no new events arrive from the queue, the events from the cache would be sent to the consumer.

jason-bragg commented 4 years ago

@yonatan-d-sap, Your analysis is correct, but it's unclear if this is a bug. I'll add some context to help us explore.

To prevent the pulling agents from accumulating dead subscriptions over time, the pulling agents will clear out inactive subscriptions periodically. The length of time an inactive subscription is tracked by the pulling agent is controlled by StreamPullingAgentOptions.StreamInactivityPeriod, and by default is 30 minutes.

The rewinding of streams were intended to handle transient errors, so the expectation is that the inactivity period will be at or below the message cached retention time, controlled by StreamCacheEvictionOptions.DataMaxAgeInCache, which also defaults to 30 minutes.

Since messages over 30 minutes old (or the configured StreamCacheEvictionOptions.DataMaxAgeInCache) won't be in the cache any more, it's considered safe to remove the subscription.

Assuming the StreamInactivityPeriod and DataMaxAgeInCache are loosely aligned, the request to rewind a stream to a point beyond those values will generate a DataNotAvailable exception when the stream becomes active again, which is the same expectation one should have if they rewound past the DataMaxAgeInCache regardless of StreamInactivityPeriod. The StreamInactivityPeriod mainly effects whether the exception is thrown immediately or after the stream becomes active again.

If the DataMaxAgeInCache is significantly larger than StreamInactivityPeriod, then valid rewind calls can fail to rewind the stream, which may be what you're experiencing. If that is the case, then aligning the two settings 'should' address this.

yonatan-d-sap commented 4 years ago

@jason-bragg The whole test scenario lasts less than a minute, so the messages are still in the cache. So these default periods of time should not affect this scenario. Correct me if I'm wrong, but according to what I understand from debugging, the reason why the cache is not checked is because the consumer unsubscribed (and then re-subscribed). When the consumer unsubscribes, he is taken out of the pub-sub cache. But when it re-subscribes it's not immediately added back to the pub-sub cache and existing cached messages are not dispatched to the consumer (in contrast to v1.3), although the messages are still in the cache.

jason-bragg commented 4 years ago

Ah, I think I see what you're getting at. If a stream has a single consumer, and the single consumer unsubscribes then resubscribes, the stream will not be rewound until there is more activity on the stream. If so, this is improper behavior. Will investigate.

ghost commented 1 year ago

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.