Open ranchodeluxe opened 2 months ago
@ranchodeluxe can you provide metrics (or job tags) so that we can look at the typical run time of the archival runs and calculate the options.
@ranchodeluxe can you provide metrics (or job tags) so that we can look at the typical run time of the archival runs and calculate the options.
Any runs since 2024-08-05 using the tag job-eis-feds-dask-coordinator-v3:1.2.1
would be good to look at
This is related to
Problem
Recently after some larger algorithm updates we've been running archival jobs and other jobs that almost make it to completion but are then killed off and have to be restarted. Some of these jobs take hours and it's a big waste of compute and time to have to do it all again.
Solution
We'd like the largest jobs that go into the queue
maap-dps-eis-worker-128gb
to not be using SPOT instances anymoreDescribe alternatives you've considered None
Additional context None