dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.09k stars 2.03k forks source link

Orleans streams fail to deliver messages on silo1 after silo2 shutdown #9126

Open FlorianMeinschad opened 2 months ago

FlorianMeinschad commented 2 months ago

We are encountering a critical issue with Orleans 8.2.0 streams that significantly impacts our application's functionality. The problem was first observed in our production environment, and although there were no silo shutdowns or any relevant logs indicating failures, the stream functionality appears to stop working correctly after a period of time.

To investigate further, we created a small proof-of-concept (POC) application that can reproduce the issue most of the time. The issue can be reproduced by shutting down a silo, which occasionally causes the subscriber grains to remain active (sometimes all subscriber grains are stopped), but they stop receiving stream messages.

We suspect a similar problem to the one mentioned in this issue https://github.com/dotnet/orleans/issues/8540.

We are unsure if this is a result of incorrect usage of Orleans streaming features, but the behavior seems inconsistent and doesn't align with our expectations.

Could you kindly review our POC code and provide an analysis of the issue? Any insights or guidance would be greatly appreciated.

Thank you in advance for your assistance. Kind regards

POC Code

https://github.com/FlorianMeinschad/OrleansStreamPOC

POC steps (The issue may not occur every time; multiple attempts might be required):

  1. Start Silo1 and execute the following URLs: http://localhost:5000/pub/5 http://localhost:5000/sub/3
  2. Start Silo2 and execute the following URL: http://localhost:5000/sub/3
  3. Optional Use the dashboard to verify that grains are hosted on both silos http://localhost:5000/dashboard
  4. Monitor the logs for both Silo1 and Silo2. You should see that both silos are receiving messages from the publisher every 5 seconds.
  5. Stop Silo2 by pressing CTRL+C in its console
  6. Check the log for Silo1. You will observe that all subscribers have stopped, and Silo1 no longer receives messages from the publisher.

Log of Silo2:

[13:36:18 INF] Running in environment Development
[13:36:18 INF] Using Orleans Silo Port 20000
[13:36:18 INF] Using Orleans Silo GW 40000
[13:36:18 INF] Use orleans localhost clustering
[13:36:19 INF] Starting Host
[13:36:19 WRN] Note: Silo not running with ServerGC turned on - recommend checking app config : <configuration>-<runtime>-<gcServer enabled="true">
[13:36:19 WRN] Note: ServerGC only kicks in on multi-core systems (settings enabling ServerGC have no effect on single-core machines).
[13:36:20 INF] Starting OrleansDashboard.SiloGrainService grain service on: S127.0.0.1:20000:83417778 x4F1BA0D3, with range <MultiRange: Size=x800B15E6, %Ring=50,017%>
[13:36:20 INF] My range changed from <(0 0], Size=x100000000, %Ring=100%> to <MultiRange: Size=x800B15E6, %Ring=50,017%> increased = True
[13:36:21 INF] Subscriber started successfully on Sile S127.0.0.1:20000:83417778
[13:36:21 INF] Subscriber started successfully on Sile S127.0.0.1:20000:83417778
[13:36:21 INF] Subscriber started successfully on Sile S127.0.0.1:20000:83417778
[13:36:24 INF] Subscriber Grain f7dbf895ee1345f8b3893e77c74375ae on Silo S127.0.0.1:20000:83417778 received publication Message sent from publisher
[13:36:24 INF] Subscriber Grain 8d8879bdae8d450a9ff87f26e2ad69b0 on Silo S127.0.0.1:20000:83417778 received publication Message sent from publisher
[13:36:24 INF] Subscriber Grain 05d60f264237439fb321ea7f2139743c on Silo S127.0.0.1:20000:83417778 received publication Message sent from publisher
[13:36:29 INF] Subscriber Grain 8d8879bdae8d450a9ff87f26e2ad69b0 on Silo S127.0.0.1:20000:83417778 received publication Message sent from publisher
[13:36:29 INF] Subscriber Grain 05d60f264237439fb321ea7f2139743c on Silo S127.0.0.1:20000:83417778 received publication Message sent from publisher
[13:36:29 INF] Subscriber Grain f7dbf895ee1345f8b3893e77c74375ae on Silo S127.0.0.1:20000:83417778 received publication Message sent from publisher
[13:36:38 INF] Stopping OrleansDashboard.SiloGrainService grain service
[13:36:38 INF] Subscriber stopped successfully
[13:36:38 INF] Subscriber stopped successfully
[13:36:38 INF] Subscriber stopped successfully
[13:36:41 INF] Host terminated successfully

Log of Silo1 (no more publisher messages are received):

Sometimes there are also following exceptions thrown:
[13:36:02 INF] Using Orleans Silo Port 11111
[13:36:02 INF] Using Orleans Silo GW 30000
[13:36:02 INF] Use orleans localhost clustering
[13:36:03 INF] Starting Host
[13:36:03 WRN] Note: Silo not running with ServerGC turned on - recommend checking app config : <configuration>-<runtime>-<gcServer enabled="true">
[13:36:03 WRN] Note: ServerGC only kicks in on multi-core systems (settings enabling ServerGC have no effect on single-core machines).
[13:36:04 INF] Starting OrleansDashboard.SiloGrainService grain service on: S127.0.0.1:11111:83417762 x9341EC4D, with range <(0 0], Size=x100000000, %Ring=100%>
[13:36:04 INF] My range changed from <(0 0], Size=x100000000, %Ring=100%> to <(0 0], Size=x100000000, %Ring=100%> increased = True
[13:36:07 INF] Starting publisher grain
[13:36:07 INF] Publisher started successfully on Sile S127.0.0.1:11111:83417762
[13:36:07 INF] Publisher grain started with Id 00000000-0000-0000-0000-000000000000
[13:36:09 INF] Starting subscriber grains
[13:36:10 INF] Subscriber started successfully on Sile S127.0.0.1:11111:83417762
[13:36:10 INF] Subscriber started successfully on Sile S127.0.0.1:11111:83417762
[13:36:10 INF] Subscriber started successfully on Sile S127.0.0.1:11111:83417762
[13:36:10 INF] 3 subscriber grains started
[13:36:14 INF] Subscriber Grain e3ccfd0618bc4383a7a7c2eab4c0d5cd on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:14 INF] Subscriber Grain e7c9aca209eb481daf0689a3940a7778 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:14 INF] Subscriber Grain c46a7d1b239345b5a77da236fbc35070 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:19 INF] Subscriber Grain e7c9aca209eb481daf0689a3940a7778 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:19 INF] Subscriber Grain e3ccfd0618bc4383a7a7c2eab4c0d5cd on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:19 INF] Subscriber Grain c46a7d1b239345b5a77da236fbc35070 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:19 INF] My range changed from <(0 0], Size=x100000000, %Ring=100%> to <MultiRange: Size=x7FF4EA1A, %Ring=49,983%> increased = True
[13:36:21 INF] Starting subscriber grains
[13:36:21 INF] 3 subscriber grains started
[13:36:24 DBG] Error during subscribing to publisher Grain
[13:36:24 DBG] Error during subscribing to publisher Grain
[13:36:24 DBG] Error during subscribing to publisher Grain
[13:36:24 INF] Subscriber Grain c46a7d1b239345b5a77da236fbc35070 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:24 INF] Subscriber Grain e7c9aca209eb481daf0689a3940a7778 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:24 INF] Subscriber Grain e3ccfd0618bc4383a7a7c2eab4c0d5cd on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:29 INF] Subscriber Grain e3ccfd0618bc4383a7a7c2eab4c0d5cd on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:29 INF] Subscriber Grain e7c9aca209eb481daf0689a3940a7778 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:29 INF] Subscriber Grain c46a7d1b239345b5a77da236fbc35070 on Silo S127.0.0.1:11111:83417762 received publication Message sent from publisher
[13:36:38 INF] My range changed from <MultiRange: Size=x7FF4EA1A, %Ring=49,983%> to <(0 0], Size=x100000000, %Ring=100%> increased = True
[13:36:38 INF] Subscriber stopped successfully

Sometimes the following exceptions are thrown:

Log of Silo2

[13:16:22 INF] Subscriber Grain 791e0ed160214841ad5d8792d290bc19 on Silo S127.0.0.1:20000:83416570 received publication Message sent from publisher
[13:16:28 INF] Stopping OrleansDashboard.SiloGrainService grain service
[13:16:28 INF] Subscriber stopped successfully
[13:16:28 ERR] Error calling grain's OnDeactivateAsync(...) method - Grain type = OrleansPOC.Grains.Subscriber.SubscriberGrain Activation = [Activation: S127.0.0.1:20000:83416570/subscriber/d75b1277d7c7457cab20e5475a994a87@988280c73309451588149aa4cf1c87f0#GrainType=OrleansPOC.Grains.Subscriber.SubscriberGrain,OrleansPOC Placement=RandomPlacement State=Deactivating]
Orleans.Runtime.OrleansMessageRejectionException: Forwarding failed: tried to forward message Request [S127.0.0.1:20000:83416570 subscriber/d75b1277d7c7457cab20e5475a994a87]->[S127.0.0.1:11111:83416554 pubsubrendezvous/STREAM_EXAMPLE/null/STREAM] Orleans.Streams.IPubSubRendezvousGrain.UnregisterConsumer(2ddc34f6-3cea-4bf9-98d0-9049bb42b957, STREAM_EXAMPLE/null/STREAM) #1036[ForwardCount=2] for 2 times after "DeactivateOnIdle was called." to invalid activation. Rejecting now.
   at Orleans.Serialization.Invocation.ResponseCompletionSource.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 98
   at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location ---
   at Orleans.Streams.StreamConsumer`1.UnsubscribeAsync(StreamSubscriptionHandle`1 handle) in /_/src/Orleans.Streaming/Internal/StreamConsumer.cs:line 183
   at OrleansPOC.Grains.Subscriber.SubscriberGrain.OnDeactivateAsync(DeactivationReason reason, CancellationToken cancellationToken) in D:\OrleansPOC\OrleansPOC\Grains\Subscriber\SubscriberGrain.cs:line 44
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable(Task task, CancellationToken cancellationToken) in /_/src/Orleans.Core/Async/TaskExtensions.cs:line 187
   at Orleans.Internal.OrleansTaskExtentions.WithCancellation(Task taskToComplete, String message, CancellationToken cancellationToken) in /_/src/Orleans.Core/Async/TaskExtensions.cs:line 144
   at Orleans.Runtime.ActivationData.<FinishDeactivating>g__CallGrainDeactivate|145_0(CancellationToken ct) in /_/src/Orleans.Runtime/Catalog/ActivationData.cs:line 1799
[13:16:30 INF] Host terminated successfully

Log of Silo1

[13:16:17 INF] Subscriber Grain 4a1dafd47e4940d28b1b695d3f7a4e8c on Silo S127.0.0.1:11111:83416554 received publication Message sent from publisher
[13:16:22 INF] Subscriber Grain 4a1dafd47e4940d28b1b695d3f7a4e8c on Silo S127.0.0.1:11111:83416554 received publication Message sent from publisher
[13:16:22 INF] Subscriber Grain 3e23f90d920d47a9b665edb9c44bbbd7 on Silo S127.0.0.1:11111:83416554 received publication Message sent from publisher
[13:16:28 INF] My range changed from <MultiRange: Size=x8E8388F1, %Ring=55,669%> to <(0 0], Size=x100000000, %Ring=100%> increased = True
[13:16:28 INF] Subscriber stopped successfully
[13:16:28 INF] Subscriber stopped successfully
[13:16:28 WRN] Failed to forward message Request [S127.0.0.1:20000:83416570 subscriber/d75b1277d7c7457cab20e5475a994a87]->[S127.0.0.1:11111:83416554 pubsubrendezvous/STREAM_EXAMPLE/null/STREAM] Orleans.Streams.IPubSubRendezvousGrain.UnregisterConsumer(2ddc34f6-3cea-4bf9-98d0
-9049bb42b957, STREAM_EXAMPLE/null/STREAM) #1036[ForwardCount=2] from [GrainAddress GrainId pubsubrendezvous/STREAM_EXAMPLE/null/STREAM, ActivationId: @436c6408319b4230b861f9e3fcb0b55a, SiloAddress: S127.0.0.1:11111:83416554] to null after DeactivateOnIdle was called.. Attempt 2
scalalang2 commented 2 months ago

If you're using MemoryStream in production. There was similar issue #8399 #8542.

carloscouce commented 2 months ago

I'm seeing the same issue with persistent (postgres) streams. As described by @FlorianMeinschad, it is rather sporadic and hard to pinpoint exactly under what circumstances consumers stop receiving messages.

feitzi commented 2 months ago

I was under the impression that Orleans was production-ready software. However, the bug (and the two comments above) indicate that it is not yet production-ready and that important functions (such as PubSub) do not work as indicated. This raises concerns about its suitability for enterprise solutions.

carloscouce commented 2 months ago

To be fair, I think persistent streams for AdoNet is relatively new so I can imagine there are still some bugs to be snuffed out. I’m going to wire up the code to Azure Queues to validate whether it’s an issue with the underlying AdoNet stream implementation.

On Tue, Aug 27, 2024 at 1:08 AM Mathias Feitzinger @.***> wrote:

I was under the impression that Orleans was production-ready software. However, the bug (and the two comments above) indicate that it is not yet production-ready and that important functions (such as PubSub) do not work as indicated. This raises concerns about its suitability for enterprise solutions.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/orleans/issues/9126#issuecomment-2311582605, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADL4KRXMXYLLQOZC7DPT7D3ZTQCUVAVCNFSM6AAAAABNDJEG6OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGU4DENRQGU . You are receiving this because you commented.Message ID: @.***>

scalalang2 commented 2 months ago

I'm not sure if this is helpful, We've used Orleans Stream with AWS SQS provider. it works fine.

feitzi commented 2 months ago

So in summary, there is currently no solution to get orleans streaming up and running without using Azure or AWS services? In our case we host the solution on our own servers. We have SQL servers and use the ADO.Net Orleans provider. The streaming is only used for a PubSub case (informing user grains about new information).

carloscouce commented 2 months ago

@scalalang2 Thanks, that is actually helpful. Haven't seen much documentation about Orleans on AWS, so didn't think of trying it!

In the meantime, I tried on Azure and it's working as expected with little to no changes to the actual domain code, so it does confirm that the issue is with the ADO.NET streaming provider.

@feitzi I'm wondering whether it's possible to mix and match the tech stack, meaning everything else is ran on ADO.NET (actor state, reminders, etc) while PubSub is ran on Azure/AWS. Will give it a try later this week...