Restart benchmarks if spot instances get replaced

coiled / benchmarks

BSD 3-Clause "New" or "Revised" License

34 stars 17 forks source link

Restart benchmarks if spot instances get replaced #1241

Open hendrikmakait opened 11 months ago

hendrikmakait commented 11 months ago

Currently, we start benchmark clusters with the spot_with_fallback policy. While this makes sense from a cost perspective, spot replacement will mess up the results. When running benchmarks, we should track whether the workers have changed during the test and restart if they did. This might remove some noise (though I have no idea how much).

fjetter commented 11 months ago

I think we should be a bit mindful about which tests to restart. For instance, we have a couple of long running tests (of questionable value) that run for +5min. If there is a worker restart, I might just scrap the result instead of restarting it.

During implementation we should also implement a counter of some sort that allows us to monitor this somehow. I just would like to know if this happens every time (in case we then could just go to on-demand anyhow) or just sporadically

fjetter commented 11 months ago

During implementation we should also implement a counter of some sort that allows us to monitor this somehow.

I don't really care how as long as it's low tech. For instance, have a global fixture that counts this and returns the value somewhere

ntabris commented 11 months ago

It also wouldn’t be too hard to get data about spot instance replacement from database

On Fri, Dec 22, 2023 at 2:26 AM Florian Jetter @.***> wrote:

During implementation we should also implement a counter of some sort that allows us to monitor this somehow.

I don't really care how as long as it's low tech. For instance, have a global fixture that counts this and returns the value somewhere

— Reply to this email directly, view it on GitHub https://github.com/coiled/benchmarks/issues/1241#issuecomment-1867448794, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALBFDHQANGMWDR6CX2NKKO3YKVG4PAVCNFSM6AAAAABA6POZV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXGQ2DQNZZGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hendrikmakait commented 11 months ago

It also wouldn’t be too hard to get data about spot instance replacement from database

Is this fast and easy for you? If so, this would be very helpful for prioritizing this issue.

ntabris commented 11 months ago

When one instance is the replacement for another spot termination, we track this in a "replacement for" column on the process.

Here's a query using that:

select inst.cluster_id, awsinstopts.spot, instset.name, count(*) n_workers
from declarative_instancemodel inst
  join declarative_processmodel proc on proc.instance_id = inst.id
  join declarative_instancesetmodel instset on instance_set_id = instset.id
  join declarative_instancespecmodel instspec on instset.spec_id = instspec.id
  join declarative_awsinstanceoptionsmodel awsinstopts on awsinstopts.id = instset.object_id
where inst.account_id = 6697 and inst.creator_id = 5144 
and instspec.name = 'worker'
and inst.created > now() - interval '7 days'
and proc.replacement_for_id is not null
group by inst.cluster_id, awsinstopts.spot, instset.name

hendrikmakait commented 11 months ago

Thanks @ntabris! From what I see, this fluctuates quite a bit, I've checked the last 4 weeks and these are the counts: