Shopify / yjit-metrics

"Tasks for benchmarking, building and collecting stats for YJIT"
MIT License
14 stars 9 forks source link

GitHub Action to check instance running time #351

Closed rwstauner closed 3 days ago

rwstauner commented 6 days ago

Currently benchmarks are taking under 4 hours to run. Reporting takes about 20 minutes.

We should add a job (or maybe add a check to the reporting job) to ensure that the benchmark instances haven't been running longer than 5 or 6 hours.

The intel instance got stuck during a benchmark recently so we need to just shut them down if it's taking too long.

rwstauner commented 5 days ago

I added a check to the reporting job that aborts (which would then alert us). We should still add something to shut them all of them down (reporting included) automatically.

maximecb commented 5 days ago

Yes I think we do need to kill them after some kind of timeout to avoid situations where we could automatically leave machines running for a long time, incurring charges.

Would it make sense to have some kind of shutdown timer on the benchmark instances, something that's going to automatically shut down the machine no matter what after 8 hours for example?

rwstauner commented 4 days ago

I think we need to do that check externally (in case the server locks up). We could just schedule another github action job to look for instances that have been running too long and stop them via the AWS api.

rwstauner commented 4 days ago

Today the new arm instance got stuck during the lobsters benchmark 😞

maximecb commented 4 days ago

Oh, like a deadlock in the benchmark?

Should we pair to try to debug an arm64 issue? Though I guess it could have something to do with upstream Ruby as well.

rwstauner commented 3 days ago

Seems that way, but I reran the same commit and it finished the second time.

Since it happened to the intel instance a few days ago and the arm one this time I more suspect just something about the aws virtualization.

I didn't see any relevant messages in the journal but it's possible there are some more system services I could disable.