Closed rwstauner closed 3 days ago
I added a check to the reporting job that aborts (which would then alert us). We should still add something to shut them all of them down (reporting included) automatically.
Yes I think we do need to kill them after some kind of timeout to avoid situations where we could automatically leave machines running for a long time, incurring charges.
Would it make sense to have some kind of shutdown timer on the benchmark instances, something that's going to automatically shut down the machine no matter what after 8 hours for example?
I think we need to do that check externally (in case the server locks up). We could just schedule another github action job to look for instances that have been running too long and stop them via the AWS api.
Today the new arm instance got stuck during the lobsters benchmark 😞
Oh, like a deadlock in the benchmark?
Should we pair to try to debug an arm64 issue? Though I guess it could have something to do with upstream Ruby as well.
Seems that way, but I reran the same commit and it finished the second time.
Since it happened to the intel instance a few days ago and the arm one this time I more suspect just something about the aws virtualization.
I didn't see any relevant messages in the journal but it's possible there are some more system services I could disable.
Currently benchmarks are taking under 4 hours to run. Reporting takes about 20 minutes.
We should add a job (or maybe add a check to the reporting job) to ensure that the benchmark instances haven't been running longer than 5 or 6 hours.
The intel instance got stuck during a benchmark recently so we need to just shut them down if it's taking too long.