Closed gabays closed 1 year ago
Additionnaly: does the --ntasks
argument have to match the amount of workers? If I have asked --tasks=4
can I run --workers 6
?
For the first question, you have to specify that you want a GPU with >= 24GB. The --mem
option set the minimum amount of CPU RAM, not GPU. You should follow those instructions to select a GPU with a minimum quantity of RAM https://doc.eresearch.unige.ch/hpc/slurm#gpgpu_jobs
For example
--gres=gpu:1,VramPerGpu:24G
indicates that you want one GPU with a minimum of 24G.
Now for the second question, there are mainly two ways of doing parallelism on a CPU : threads and processes. I understand that the --workers
parameter specify the number of threads used when running on a CPU. With slurm, the --ntasks
specify the number of processes (or tasks) to use. If you want to use multiple threads you should use the option --cpus-per-task
set to the same value as --workers
. Hope this helps.
To add something, I think at some point I told you to use ntask=4
to avoid DataLoader crash. In fact this was just increasing the CPU memory available, because de more tasks are allocated, the more memory is allocated. In fact the correct way to do is to set --mem
to a large enough value and --cpus-per-task
to the same value as --workers
. And if you use GPU, set VramPerGpu
to a value at least equal to --mem
.
OK. Thx a lot. I am currently updating the documentation to explain all this to future users, and avoid to many mails/issues in the future
any recommended value for the memory if you happened to train a model?
Honestly... no idea. I have few expertise in this field. My only advice would be to try with a "reasonable" value (whatever that means, like 10GB) and increase in case of problem.
Command is not working:
(kraken-env) (yggdrasil)-[gabays@login1 ~]$ salloc --partition=shared-gpu --time=01:00:00 --gpus=1 --mem=10GB --cpus-per-task=8 --gres=gpu:1,VramPerGpu:10G
salloc: error: Invalid generic resource (gres) specification
Seems to work on Baobab but not Yggdrasil. I just sent an email to the hpc team.
Problems solved in #17 with:
salloc --partition=shared-gpu --time=01:00:00 --gpus=1 --mem=24GB --cpus-per-task=8 --gres=gpu:1,VramPerGpu:24G
Hey @pkzli,
How can we select a specific GPU, or a type of GPU? I see that some are more powerful than others:
https://doc.eresearch.unige.ch/hpc/hpc_clusters
To get better results, it is better to process larger images with the yolov5x6 model, but it then requires much more memory and I keep having
RuntimeError: DataLoader worker is killed by signal: Killed.
I can reduce the size of the batch but I would lower my F1 score.I have been told I would need at least 24GB. I can add
--mem=24G
tosalloc
but I am allocated the gpu[007] which (supposedely) is a P100 with 12GB