Open hendrikmakait opened 11 months ago
I think we should be a bit mindful about which tests to restart. For instance, we have a couple of long running tests (of questionable value) that run for +5min. If there is a worker restart, I might just scrap the result instead of restarting it.
During implementation we should also implement a counter of some sort that allows us to monitor this somehow. I just would like to know if this happens every time (in case we then could just go to on-demand anyhow) or just sporadically
During implementation we should also implement a counter of some sort that allows us to monitor this somehow.
I don't really care how as long as it's low tech. For instance, have a global fixture that counts this and returns the value somewhere
It also wouldn’t be too hard to get data about spot instance replacement from database
On Fri, Dec 22, 2023 at 2:26 AM Florian Jetter @.***> wrote:
During implementation we should also implement a counter of some sort that allows us to monitor this somehow.
I don't really care how as long as it's low tech. For instance, have a global fixture that counts this and returns the value somewhere
— Reply to this email directly, view it on GitHub https://github.com/coiled/benchmarks/issues/1241#issuecomment-1867448794, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALBFDHQANGMWDR6CX2NKKO3YKVG4PAVCNFSM6AAAAABA6POZV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXGQ2DQNZZGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
It also wouldn’t be too hard to get data about spot instance replacement from database
Is this fast and easy for you? If so, this would be very helpful for prioritizing this issue.
When one instance is the replacement for another spot termination, we track this in a "replacement for" column on the process.
Here's a query using that:
select inst.cluster_id, awsinstopts.spot, instset.name, count(*) n_workers
from declarative_instancemodel inst
join declarative_processmodel proc on proc.instance_id = inst.id
join declarative_instancesetmodel instset on instance_set_id = instset.id
join declarative_instancespecmodel instspec on instset.spec_id = instspec.id
join declarative_awsinstanceoptionsmodel awsinstopts on awsinstopts.id = instset.object_id
where inst.account_id = 6697 and inst.creator_id = 5144
and instspec.name = 'worker'
and inst.created > now() - interval '7 days'
and proc.replacement_for_id is not null
group by inst.cluster_id, awsinstopts.spot, instset.name
Thanks @ntabris! From what I see, this fluctuates quite a bit, I've checked the last 4 weeks and these are the counts:
Currently, we start benchmark clusters with the
spot_with_fallback
policy. While this makes sense from a cost perspective, spot replacement will mess up the results. When running benchmarks, we should track whether the workers have changed during the test and restart if they did. This might remove some noise (though I have no idea how much).