dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.07k stars 2.03k forks source link

Orleans stalled #5523

Closed MohamedMohsenRadwan closed 5 years ago

MohamedMohsenRadwan commented 5 years ago

Hi We have a project that use Microsoft Orleans 2.2 in .Net Framework 4.7.1 and SQL server as Storage. When we load test it, it always fails at the 4th day that it stop responding any new requests and CPU usage in almost 0 while the number of threads gets increased heavily. The failure messages at the log : [2019-04-10 20:06:29.125 GMT 13 Warning 101215 Orleans.Runtime.Scheduler.WorkItemGroup] Task [Id=77224074, Status=RanToCompletion] in WorkGroup [Activation: S10.1.0.5:11111:292432306grn/8C9EE89/000026cf@38040cb6 #GrainType=Softec.FMS.Orleans.Implementation.Grains.RawDataSaverGrain Placement=RandomPlacement State=Valid] took elapsed time 0:00:00.8558401 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread [2019-04-10 20:07:38.517 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ReadTableTimer did not fire on time. Last fired at 2019-04-10 20:06:19.090 GMT, 00:01:19.4272142 since previous fire, should have fired after 00:01:00. [2019-04-10 20:07:38.517 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ProbeTimer did not fire on time. Last fired at 2019-04-10 20:06:20.726 GMT, 00:01:17.7906305 since previous fire, should have fired after 00:00:10. [2019-04-10 20:07:38.517 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.IAmAliveTimer did not fire on time. Last fired at 2019-04-10 20:01:57.492 GMT, 00:05:41.0251127 since previous fire, should have fired after 00:05:00.
[2019-04-10 20:07:38.517 GMT 37 Warning 102603 Orleans.Runtime.Watchdog] Watchdog had 1 Health Check Failure(s) out of 2 Health Check Participants.
[2019-04-10 20:09:07.479 GMT 17678 Warning 100157 Orleans.Runtime.CallbackData] Response did not arrive on time in 00:00:30 for message: Request S10.1.0.5:11111:292432306
grn/7C16CD43/00000000+357454074159300@e982059b->S10.1.0.5:11111:292432306grn/7C16CD43/00000000+357454074159300@e982059b #607940229: . Target History is: <S10.1.0.5:11111:292432306:grn/7C16CD43/00000000+357454074159300:@e982059b>. About to break its promise.
[2019-04-10 20:12:38.527 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ReadTableTimer did not fire on time. Last fired at 2019-04-10 20:06:19.090 GMT, 00:06:19.4374713 since previous fire, should have fired after 00:01:00. [2019-04-10 20:12:38.527 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ProbeTimer did not fire on time. Last fired at 2019-04-10 20:06:20.726 GMT, 00:06:17.8008876 since previous fire, should have fired after 00:00:10. [2019-04-10 20:12:38.527 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.IAmAliveTimer did not fire on time. Last fired at 2019-04-10 20:01:57.492 GMT, 00:10:41.0353698 since previous fire, should have fired after 00:05:00.
[2019-04-10 20:12:38.527 GMT 37 Warning 102603 Orleans.Runtime.Watchdog] Watchdog had 1 Health Check Failure(s) out of 2 Health Check Participants.
[2019-04-10 20:17:38.574 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ReadTableTimer did not fire on time. Last fired at 2019-04-10 20:06:19.090 GMT, 00:11:19.4845859 since previous fire, should have fired after 00:01:00. [2019-04-10 20:17:38.574 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ProbeTimer did not fire on time. Last fired at 2019-04-10 20:14:43.982 GMT, 00:02:54.5922485 since previous fire, should have fired after 00:00:10. [2019-04-10 20:17:38.574 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.IAmAliveTimer did not fire on time. Last fired at 2019-04-10 20:01:57.492 GMT, 00:15:41.0824844 since previous fire, should have fired after 00:05:00.
[2019-04-10 20:17:38.574 GMT 37 Warning 102603 Orleans.Runtime.Watchdog] Watchdog had 1 Health Check Failure(s) out of 2 Health Check Participants.
[2019-04-10 20:22:38.616 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ReadTableTimer did not fire on time. Last fired at 2019-04-10 20:06:19.090 GMT, 00:16:19.5256852 since previous fire, should have fired after 00:01:00. [2019-04-10 20:22:38.616 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ProbeTimer did not fire on time. Last fired at 2019-04-10 20:14:43.982 GMT, 00:07:54.6333478 since previous fire, should have fired after 00:00:10. [2019-04-10 20:22:38.616 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.IAmAliveTimer did not fire on time. Last fired at 2019-04-10 20:01:57.492 GMT, 00:20:41.1235837 since previous fire, should have fired after 00:05:00.
[2019-04-10 20:22:38.616 GMT 37 Warning 102603 Orleans.Runtime.Watchdog] Watchdog had 1 Health Check Failure(s) out of 2 Health Check Participants.
[2019-04-10 20:27:39.500 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ReadTableTimer did not fire on time. Last fired at 2019-04-10 20:06:19.090 GMT, 00:21:20.4096932 since previous fire, should have fired after 00:01:00. [2019-04-10 20:27:39.500 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ProbeTimer did not fire on time. Last fired at 2019-04-10 20:14:43.982 GMT, 00:12:55.5173558 since previous fire, should have fired after 00:00:10. [2019-04-10 20:27:39.500 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.IAmAliveTimer did not fire on time. Last fired at 2019-04-10 20:01:57.492 GMT, 00:25:42.0075917 since previous fire, should have fired after 00:05:00.
[2019-04-10 20:27:39.500 GMT 37 Warning 102603 Orleans.Runtime.Watchdog] Watchdog had 1 Health Check Failure(s) out of 2 Health Check Participants.
[2019-04-10 20:32:39.634 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ReadTableTimer did not fire on time. Last fired at 2019-04-10 20:06:19.090 GMT, 00:26:20.5445034 since previous fire, should have fired after 00:01:00. [2019-04-10 20:32:39.634 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.ProbeTimer did not fire on time. Last fired at 2019-04-10 20:14:43.982 GMT, 00:17:55.6521660 since previous fire, should have fired after 00:00:10. [2019-04-10 20:32:39.634 GMT 37 Error 101412 Orleans.Runtime.GrainTimer] !!!!!!!!!! Watchdog Freeze Alert: Orleans.Runtime.asynTask.SafeTimerBase.Membership.IAmAliveTimer did not fire on time. Last fired at 2019-04-10 20:01:57.492 GMT, 00:30:42.1424019 since previous fire, should have fired after 00:05:00.
[2019-04-10 20:32:39.634 GMT 37 Warning 102603 Orleans.Runtime.Watchdog] Watchdog had 1 Health Check Failure(s) out of 2 Health Check Participants.

You can kindly find attached the full Log Log.zip

MohamedMohsenRadwan commented 5 years ago

any help here ?

ReubenBond commented 5 years ago

Could you give more details about your environment? OS, .NET Runtime, deployment, etc? Are you able to capture ETW traces and perhaps a memory dump? There are a lot of messages about null state from storage - what's happening when the silo starts up that there are so many activations?

Do you know why the silo is stopping and restarting part-way through those logs?

MohamedMohsenRadwan commented 5 years ago

OS: Windows Server 2012 R2 .Net Runtime: 4.7.1 Memory is stable and no memory leak exist. We are using RX scan Operator which take some memory until it has been finished. (Our system is a car tracking system that will serve a lot of vehicles so the activation number is normal. Also the memory is normal as the RX scan operator hold the packets per vehicle until an event is finished (for example, if the vehicle is moving then each packet the vehicle is sent during the moving time will be added to a list on the Scan Operator until it stop. We send a close packet at the end of the day by default, so the worst event time will be 24 hours. The memory is stable at 6 G until the end of the day, then it reach about 4 G after the opened events is closed by our closing events packet) We stopped writing the State to DB as a step in our investigation to the problem to eliminate the blocking I/O possibility. I don't know why the silo is stopping.

ReubenBond commented 5 years ago

It appears that something is telling the silo to stop. How is the silo being hosted? Could you include the Program.cs?

MohamedMohsenRadwan commented 5 years ago

It is deployed as Windows Service This the code of the hosting var siloBuilder = new SiloHostBuilder().UseAdoNetClustering(options => { options.ConnectionString = ConfigurationManager.ConnectionStrings["OrleansMembership"].ConnectionString; options.Invariant = ConfigurationManager.ConnectionStrings["OrleansMembership"].ProviderName;

            })
            .AddAdoNetGrainStorage("PubSubStore", options =>
            {
                options.ConnectionString = ConfigurationManager.ConnectionStrings["OrleansStorage"].ConnectionString;
                options.Invariant = ConfigurationManager.ConnectionStrings["OrleansStorage"].ProviderName;
                options.UseJsonFormat = true;
            })
            .Configure<ClusterOptions>(options =>
            {
                options.ClusterId = "fms";
                options.ServiceId = "SvcId";

            })
            .UseAdoNetReminderService(options =>
            {
                options.ConnectionString = ConfigurationManager.ConnectionStrings["OrleansStorage"].ConnectionString;
                options.Invariant= ConfigurationManager.ConnectionStrings["OrleansStorage"].ProviderName;
            })
            .Configure<EndpointOptions>(options => options.AdvertisedIPAddress = System.Net.IPAddress.Loopback)

            .ConfigureApplicationParts(parts => parts.AddApplicationPart(typeof(DeviceAvatarGrain).Assembly).WithReferences())
            .ConfigureLogging(logging => {
                logging.AddConsole().AddDebug().AddProvider(new FileLoggerProvider(@"C:\\Log\Log.txt"));

               // logging.AddEventSourceLogger();
                logging.SetMinimumLevel(LogLevel.Trace);
            }
            )
            //.UseDashboard();

            _siloHost = siloBuilder.Build();

            _siloHost.StartAsync().Wait();
        }
        catch(Exception ex)
        {
            Logger.Log(ex);
        }

Also attached Program.cs and The code of windows service that it call

Program.zip

ReubenBond commented 5 years ago

It's hard to determine the issue without seeing ETW traces or a memory dump. It is likely not an Orleans issue. It could be blocking, too many timers, too many allocations, or something else.

Could you capture ETW traces and memory dumps?

IbrahimAhmed201 commented 5 years ago

hi @ReubenBond I am working with @MohamedMohsenRadwan in the team we started the test from beginning again and will provide you memory dump when service stop

ReubenBond commented 5 years ago

Thank you, @IbrahimAhmed201

ReubenBond commented 5 years ago

Closing due to inactivity. Please feel free to message here and we can re-open