Orleans stream pulling agent stops pulling messages from stream suddenly

dotnet / orleans

Cloud Native application framework for .NET

https://docs.microsoft.com/dotnet/orleans

MIT License

10.1k stars 2.04k forks source link

Orleans stream pulling agent stops pulling messages from stream suddenly #7195

Open MittalNitish opened 3 years ago

MittalNitish commented 3 years ago

Hi Team / @jason-bragg

We are using Orleans 2.2.4.

We are using Azure eventhub stream with orleans stream. Previously we were using the ConsistentRingQueueBalancer, with this queue balancer streams were not distributed equally among silos, causing high memory usage at some silos. We made changes to use ClusterConfigDeploymentLeaseBasedBalancer recently. Since using lease based balancer we have noticed that sometimes stream subscriber grain stops receiving messages from stream. However there is no message publish error at sender grain side. When we restart the silos, old pending messages starts getting processed. This issue was not present with ConsistentRingQueueBalancer. We need your help in finding the cause and debugging this issue. Please let me know if you guys need any specific log for this.

Thanks

benjaminpetit commented 3 years ago

Orleans 2.2.4 is quite old now. Could you upgrade to the latest 3.x release? You may have some changes in the config, my everything else should be backward compatible

MittalNitish commented 3 years ago

Thanks @benjaminpetit ,

Is it a known issue/bug with the version 2.2.4? If yes, is it working well with the latest version? Our code base is huge and using .Net Framework 4.6.2 which makes it very difficult to upgrade the Orleans. So we need to make sure upgrading would fix this.

benjaminpetit commented 3 years ago

I don't recall a related issue that we might have fixed. But it's old and more difficult for us to investigate.

I see that you are using the LeaseBasedQueueBalancer based on the StaticClusterDeploymentOptions, which might be the issue: on reboot, do your silos have the same name? Do you have any logs for the queue balancer?

MittalNitish commented 3 years ago

Yes @benjaminpetit ,

Silos have same clusterId set to a constant value in builder as:

builder.Configure<ClusterOptions>(clusterOptions =>
            {
                clusterOptions.ClusterId = OrleansConfigurationConstants.ClusterId;
                clusterOptions.ServiceId = OrleansConfigurationConstants.ServiceId;
            });

I will fetch some logs and update here in a while.

jason-bragg commented 3 years ago

The LeaseBasedQueueBalancer was introduced in 2.x and had some issues that were addressed in 3.x, so Yes, there are known issues with that system.

MittalNitish commented 3 years ago

Hi @benjaminpetit , Added logs during Silo restarts and code snippet for stream configuration: logs_code_snippet.zip

ghost commented 2 years ago

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.