dotnet / dnceng

.NET Engineering Services
MIT License
25 stars 19 forks source link

Investigate issues with Ubuntu.2204.amd64 machines losing heartbeats in Helix #3596

Open riarenas opened 3 months ago

riarenas commented 3 months ago

We are seeing a large number of deadVMcleaner cleanups for the ubuntu amd64 queues.

This application insights query shows that the VM cleaner deletes a lot of unresponsive machines from these queues.

image

Initial investigations into some of these machines showed they were running out of memory while executing workitems, but this needs to be investigated further.

Release Note Category

riarenas commented 3 months ago

The spikes that happen in the .rt queue every week are very noticeable. Is this a weekly run of some potentially unstable tests that might be killing the machines?

riarenas commented 1 month ago

I doubt I'll be able to find anything in 1 day as the issue costing suggests, but I'll start looking.

riarenas commented 1 month ago

I haven't found any obvious commonality in what causes the machines to stop working. It might be worth considering attaching this issue to https://github.com/dotnet/dnceng/issues/4045, as this is probably one of the causes.