JuliaHealth / KomaMRI.jl

Koma is a Pulseq-compatible framework to efficiently simulate Magnetic Resonance Imaging (MRI) acquisitions. The main focus of this package is to simulate general scenarios that could arise in pulse sequence development.
https://JuliaHealth.github.io/KomaMRI.jl
MIT License
114 stars 20 forks source link

Don't run CPU benchmarks on GPU workers #501

Open maleadt opened 3 hours ago

maleadt commented 3 hours ago

Please don't use the GPU workers for these long-running jobs:

https://github.com/JuliaHealth/KomaMRI.jl/blob/00a8d8a531b96a95447faaa883d88617736a7da9/.buildkite/runtests.yml#L4-L33

cncastillo commented 3 hours ago

I am slightly confused; the code block you show is for CPU tests, and the title is for CPU benchmarks.

The CPU test should take around 4 minutes.

The CPU benchmark should take around 6 minutes each (4 jobs running in parallel).

I can delete both if that is what you meant (to use Buildkite only for GPUs), just trying to understand which one was generating problems.

Sorry for the inconvenience.

maleadt commented 3 hours ago

I am slightly confused; the code block you show is for CPU tests, and the title is for CPU benchmarks.

Right, so why does it run on GPU workers?

         agents: 
           queue: "juliagpu" 

It's probably better to use juliaecosystem for this.


The CPU test should take around 4 minutes.

It doesn't, though: https://buildkite.com/julialang/komamri-dot-jl/builds/1220#0192be92-3897-497e-8654-6ffcdf6f7cc1 This ran for 40 minutes doing CPU benchmarks on a GPU worker before I canceled it due to some system maintenance.

Here's another instance: https://buildkite.com/julialang/komamri-dot-jl/builds/1220#0192be92-3417-48e3-8401-b2853349fa0f. It seems that they get stuck at some point?

pvillacorta commented 2 hours ago

I have just seen this, sorry for this last commits I made 😓

cncastillo commented 20 minutes ago

It's probably better to use juliaecosystem for this.

Oh, my bad; we can move the CPU stuff to queue: "juliaecosystem" if that is fine and set timeout_in_minutes: 10 for all jobs. Does that sound reasonable?

Here's another instance: buildkite.com/julialang/komamri-dot-jl/builds/1220#0192be92-3417-48e3-8401-b2853349fa0f. It seems that they get stuck at some point?

This is surprising, it seems to get stuck during package precompilation.