Closed rwstauner closed 9 months ago
12 hours sounds a bit tight to me -- you might genuinely want to test something that runs that long. But it does have the advantage that you won't wind up scheduling multiple runs that step on each other. Up to you.
Possibly relevant: @maximecb has talked about moving to a more expensive ARM instance than the ancient Gravitons we currently test on. Right now our run length is very much determined by those Graviton instances. x86 finishes much faster. So if that changes, 12 hours could become luxuriously long compared to the run length.
I think a timeout is a good idea
There are also other tricks we can play. We could have a timeout for each benchmark run in the harness itself. That is, assuming that's not already the case, we allow up to a maximum of one minute for warmup, and another minute for benchmarking (or twice that much). It would increase the standard deviation a bit, but it would ensure that the benchmark runs all terminate relatively quickly, even on older machines.
It's more important to ensure that we get fresh results quickly than to have the most accurate measurement possible.
I proposed a simple PR that puts a 12 hour timeout on the job since the average seems to be about 9. If we change hardware we can easily adjust.
It's not clear where the last one got stuck so it's hard to say if this would fix it. It seemed to be between steps so while putting timeouts into the benchmark scripts may be something we want to do, it probably wouldn't have helped the problem we saw last time.
A build got stuck for multiple days, we should add a timeout to the configurable run job (doesn't necessarily need to be configurable itself). Maybe 24h? or 12h?