dnanexus / dx-toolkit

DNAnexus platform client libraries
http://autodoc.dnanexus.com/
Apache License 2.0
90 stars 85 forks source link

APPS-2668 docker legacy flag #1389

Open mhrvol opened 1 month ago

mhrvol commented 1 month ago

I am a bit unsure how to test this.

Even when working locally with Nextflow, the docker seems to always have visibility to all CPUs of the host machine - so nproc of a nextflow workflow will always yield number of cpus on host machine, even if the parameter is set to a lower number (same with docker run --cpus...).

It is possible to check the maximum number of cpus that docker can use by looking into following file: cat /sys/fs/cgroup/cpu.max

When running docker run --cpus x, that will will contain following: x00000 100000 - this means " ", so the container can use maximum of CPU time in each period. So if we have 8 cpus, the max value is 800000 100000, which suggests availability of all cpus. 200000 100000 would mean if are effectivelly allowing it to use only 2 CPUs at the time.

However this does not apply to Nextflow, if I try to check the same file when running Nextflow with Docker locally, I always get max 100000 result.

My Docker version is Docker version 26.1.2, build 211e74b240, my Nextflow version is nextflow version 23.10.1.5891

I am a bit unsure if Nextflow limits the Dockers differently, or how exactly the cpus flag works.

EDIT:

It might be possible to test using sleep and looking at how many cpus are used at the time. Interestingly enough, when I set cpus to 1, all processes were done at the same time, when I set it to max (8 on my cpu), processes are done one-by-one

mhrvol commented 1 month ago

Manual test here: --cpus 2: https://staging.dnanexus.com/panx/projects/GG76K7j01xy3fZ7zFZ94YBY8/monitor/job/Gpkb08j01xyBf874ZxKFFBzF

--cpus 1: https://staging.dnanexus.com/panx/projects/GG76K7j01xy3fZ7zFZ94YBY8/monitor/job/Gpkb4JQ01xy94V941k3427Vg

Nextflow behaves differently than I've expected - when you set cpus to 1, all cpus are used, if you set it to nproc, only 1 is used.

Also note that the "second batch" of the "--cpus 2" is not started at the same time, but it is due to overhead (either on our side or on nextflow side)

mhrvol commented 1 month ago

The manual test did not work as expected, it is due to the fact that Nextflow behaves differently with cpu directive when working locally vs when working in a cloud environment - rather, when working in cloud environment, CPUS directive does not affect the number of instances/nodes used, but rather the CPU allocation for processes.

I find this difficult to test - maybe it could be test by running some multi-threaded stuff in every process and checking the CPU usage.

Removing this docker legacy flag does not seem to break anything (Though, I did not try things like Sarek). It seems that with this flag, Nextflow by default uses "all that's available", which seems like the only use case our customer could really want (why wouldn't they utilize all the cpus in an instance?).

I am up for removing it with this investigation if all our current tests pass, @r-i-v-a what do you think?