FIre Atlas Team wants to stop using SPOT instances for the 128gb Queue

MAAP-Project / Community

Issue for MAAP (Zenhub)

2 stars 1 forks source link

FIre Atlas Team wants to stop using SPOT instances for the 128gb Queue #1046

Open ranchodeluxe opened 2 months ago

ranchodeluxe commented 2 months ago

Problem

Recently after some larger algorithm updates we've been running archival jobs and other jobs that almost make it to completion but are then killed off and have to be restarted. Some of these jobs take hours and it's a big waste of compute and time to have to do it all again.

Solution

We'd like the largest jobs that go into the queue maap-dps-eis-worker-128gb to not be using SPOT instances anymore

Describe alternatives you've considered None

Additional context None

wildintellect commented 2 months ago

@ranchodeluxe can you provide metrics (or job tags) so that we can look at the typical run time of the archival runs and calculate the options.

ranchodeluxe commented 2 months ago

@ranchodeluxe can you provide metrics (or job tags) so that we can look at the typical run time of the archival runs and calculate the options.

Any runs since 2024-08-05 using the tag job-eis-feds-dask-coordinator-v3:1.2.1 would be good to look at

wildintellect commented 2 months ago

This is related to

1053

MAAP-Project / Community