dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.11k stars 2.04k forks source link

System.TimeoutException: Response did not arrive on time in 00:00:30 for message #8282

Closed JaktensTid closed 11 months ago

JaktensTid commented 1 year ago

Hi,

I am getting

System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request [S172.17.48.38:11111:33609578 symactor/666aeb257e2b47c0ae01dab451f3e10d]

I am using ado.net clustering as such builder.Host.UseOrleans(silo => { silo .UseAdoNetClustering(options => { options.Invariant = "Npgsql"; options.ConnectionString = builder.Configuration.GetConnectionString("postgres"); }) .ConfigureLogging(options => options.SetMinimumLevel(LogLevel.Information)) .AddIncomingGrainCallFilter<LoggingInterceptor>() .AddAdoNetGrainStorage(StateStorageConstants.StorageName, options => { options.Invariant = "Npgsql"; options.ConnectionString = builder.Configuration.GetConnectionString("postgres"); }); });

Can someone help with this?

ReubenBond commented 1 year ago

Are you able to provide any more details? The error message should have included more than that

JaktensTid commented 1 year ago

@ReubenBond here it is

warn: Orleans.Runtime.CallbackData[100157] 2023-01-25T00:00:56.871716+00:00 app[web.1]: Response did not arrive on time in 00:00:30 for message: Request [S172.17.48.38:11111:33609578 symactor/666aeb257e2b47c0ae01dab451f3e10d]->[S172.17.48.38:11111:33609578 virtualmachinespoolactor/virtualMachineActorGrain] Assembly.Orca.Actors.Actors.Interfaces.IVirtualMachinesPoolActor[(Assembly.Orca.Actors.Actors.Interfaces.IVirtualMachinesPoolActor)Assembly.Orca.Actors.Actors.VirtualMachinesPoolActor].GetFreeVM(666aeb25-7e2b-47c0-ae01-dab451f3e10d, 3a23cfa3-4216-43fd-a49f-a6cb68870da2) #53. . About to break its promise. 2023-01-25T00:00:56.872776+00:00 app[web.1]: fail: Orleans.Runtime.GrainTimer[101413] 2023-01-25T00:00:56.872780+00:00 app[web.1]: Caught and ignored exception thrown from timer callback for timer GrainTimer. TimerCallbackHandler:Assembly.Orca.Actors.Actors.SymActor->System.Threading.Tasks.Task MainLoop(System.Object) 2023-01-25T00:00:56.872794+00:00 app[web.1]: System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request [S172.17.48.38:11111:33609578 symactor/666aeb257e2b47c0ae01dab451f3e10d]->[S172.17.48.38:11111:33609578 virtualmachinespoolactor/virtualMachineActorGrain] Assembly.Orca.Actors.Actors.Interfaces.IVirtualMachinesPoolActor[(Assembly.Orca.Actors.Actors.Interfaces.IVirtualMachinesPoolActor)Assembly.Orca.Actors.Actors.VirtualMachinesPoolActor].GetFreeVM(666aeb25-7e2b-47c0-ae01-dab451f3e10d, 3a23cfa3-4216-43fd-a49f-a6cb68870da2) #53. 2023-01-25T00:00:56.872809+00:00 app[web.1]: at Orleans.Serialization.Invocation.ResponseCompletionSource1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230 2023-01-25T00:00:56.872812+00:00 app[web.1]: at System.Threading.Tasks.ValueTask1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state) 2023-01-25T00:00:56.872813+00:00 app[web.1]: --- End of stack trace from previous location --- 2023-01-25T00:00:56.872813+00:00 app[web.1]: at Assembly.Orca.Actors.States.SymLogicHandler.CreatingAttaching(Guid symId, ISymActor symActor) in /src/Assembly.Orca.Actors/Assembly.Orca.Actors/States/SymLogicHandler.cs:line 54 2023-01-25T00:00:56.872816+00:00 app[web.1]: at Assembly.Orca.Actors.Actors.SymActor.MainLoop(Object parameter) in /src/Assembly.Orca.Actors/Assembly.Orca.Actors/Actors/SymActor.cs:line 54 2023-01-25T00:00:56.872818+00:00 app[web.1]: at Orleans.Runtime.GrainTimer.ForwardToAsyncCallback(Object state) in /_/src/Orleans.Runtime/Timers/GrainTimer.cs:line 139

ReubenBond commented 1 year ago

It's hard to say what is causing this. It's probable that your code is blocking on something, but it's hard to say what. I expected to see a status response to appear in the timeout message, but I'm not seeing that here. Is the system under very high load?

JaktensTid commented 1 year ago

No, but I call Task.Factory.StartNew at some point. Can this cause the issue?

ReubenBond commented 1 year ago

Maybe, what does your call look like?

JaktensTid commented 1 year ago
public async Task<string> ExecuteCommand(IHostCommand command, string ipAddress)
    {
        var sshKey = await _sshKeyReader.GetSshKey();
        var commandText = await command.GetCommandContent();
        return await Task.Factory.StartNew(() =>
        {
            // blocking calls here
        });
    }

Something like this

ReubenBond commented 1 year ago

Oh, you perform blocking calls in there. Do you pass other arguments to Task.Factory.StartNew, like TaskScheduler.Default? You should use Task.Run if this is what you're going to do, or consider something like @Kritner's Orleans.SyncWork package: https://github.com/OrleansContrib/Orleans.SyncWork

JaktensTid commented 1 year ago

No, just plan Task.Factory.StartNew Ok, I'll try. Can you keep this thread opened? I will try this a little bit later

ReubenBond commented 1 year ago

Can you keep this thread opened?

Sure thing, let us know how it goes.

JaktensTid commented 1 year ago

Changed Task.Factory.StartNew to Task.Run, but still it is throwing me the same error. Should the EVERY method of grain take no more than 30 sec? If so why it is working locally?

JaktensTid commented 1 year ago

Is there a way to increase this timeout to a couple of minutes maybe?

ReubenBond commented 1 year ago

You must never block in grain code. You can increase the ClientMessagingOptions.ResponseTimeout and SiloMessagingOptions.ResponseTimeout to increase the duration. What are you doing in these tasks, how long should they take?

JaktensTid commented 1 year ago

Ok, so I just increased ResponseTimeout to 10 mins, since almost all of my methods in the grains will be taking a lot of time and I don't want to bloat the code with state machine pattern

Kritner commented 1 year ago

Ok, so I just increased ResponseTimeout to 10 mins, since almost all of my methods in the grains will be taking a lot of time and I don't want to bloat the code with state machine pattern

I think if you just increase the timeout across the board, you'll theoretically eventually run into a point where orleans is starved and can no longer do acks.

ReubenBond commented 1 year ago

The .NET ThreadPool itself will be becoming starved. You should move this work onto a dedicated pool.

JaktensTid commented 1 year ago

@ReubenBond how to do this? Task.Run?

ReubenBond commented 1 year ago

Task.Run will be better than Task.Factory.StartNew - it won't block your grain's scheduler, at least.

Both will contribute to thread pool starvation, though, so it would be ideal to isolate your work by putting it on an isolated thread pool.

Of course, you will find that at some level of load, the system stops responding in an acceptable period. I don't know what your target request rate & concurrent requests is, or how long each request takes to service, so it's hard to provide any concrete advice other than telling you what not to do.

Kritner commented 1 year ago

fwiw the code at https://github.com/usnistgov/acvp-server effectively uses the package from https://github.com/OrleansContrib/Orleans.SyncWork that @ReubenBond pointed out earlier.

The package Orleans.SyncWork came from learnings in implementing the abstraction for the acvp-server.

This project queues up thousands of blocking grain calls more or less "per request" and handles it fine w/o overloading the cluster (though careful configuration is necessary).

Don't know if this will be a possibility for you, but the package should work for (at least) the specific circumstances we needed.