Update manifest with new SLURM requirements

fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform

https://fractal-analytics-platform.github.io/fractal-tasks-core/

BSD 3-Clause "New" or "Revised" License

14 stars 6 forks source link

Update manifest with new SLURM requirements #358

Closed tcompa closed 1 year ago

tcompa commented 1 year ago

Let's spell out the cpu/memory/gpu requirements for all tasks. Here is a starting point:

create-ome-zarr (multiplex or standard) and copy-ome-zarr: 1 cpu, 4G
convert-yokogawa-to-zarr & illumination-correction: 1 cpu, 4G
MIP: 1 cpu, 4G (maybe this one can also use larger resources, because of https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/115)
cellpose: 16 cpus (??), 62G (???), needs_gpu=True
napari-workflows: ??

Ref:

https://github.com/fractal-analytics-platform/fractal-server/pull/582

jluethi commented 1 year ago

We'll need to test it to see what the actual usage is. Not sure I have a great intuition for how memory efficient e.g. our current illumination correction task actually is.

On the GPU side, we always get the full node at UZH, right? But at FMI, we only get what we request and other people can run things on the same node as well. We could start with somewhat lower defaults. And, if I understand correctly, we wouldn't hit a slurm error on the UZH side if we request 16 GB and use 20 GB of RAM, because the rest of the node is anyway free at that moment, right?

create-ome-zarr (multiplex or standard) and copy-ome-zarr: 1 cpu, 1G => shouldn't need much
convert-yokogawa-to-zarr & illumination-correction: 1 cpu, 4G
MIP: 1 cpu, 4G
cellpose: 4 cpus, 16G, needs_gpu=True => varies strongly depending on size of ROIs & resolution level
napari-workflows: 1 cpu, 4G => decent start, may vary depending on workflow and ROIs

tcompa commented 1 year ago

But at FMI, we only get what we request

(you do get the entire node memory, but at some point the cgroup out-of-memory handler will/may kill your slurm job, as in https://github.com/fractal-analytics-platform/fractal-server/issues/343)

And, if I understand correctly, we wouldn't hit a slurm error on the UZH side if we request 16 GB and use 20 GB of RAM, because the rest of the node is anyway free at that moment, right?

Agreed, although that's not something I would rely on, long-term. Once we test things a bit further, we should not request 16 if we know that 20 are needed.

tcompa commented 1 year ago

napari-workflows: 1 cpu, 4G => decent start, may vary depending on workflow and ROIs

Any reason for not increasing the number of cpus a bit?

When running the CI (which uses some very small test datasets), the napari-workflow task quickly reaches 800% CPU usage with multithreading, after a few seconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                
  16899 tommaso   20   0 3510400 783912 190976 S 793.4   4.9   0:30.04 /home/tommaso/.cache/pypoetry/virtualenvs/fractal-tasks-core-UoMDyr20-py3.10/bin/python /home/tommaso/.cache/pypoetry/virtualenvs/fractal-tasks-core-UoMDyr20-py3.10/bin/pytest tests/test_workflows_napari_workflows.py

jluethi commented 1 year ago

Ah, thanks for the profiling! Yeah, then let's go 8 CPUs, 32G RAM for napari workflows by default :)