FoNDUE-HTR / Documentation

Documentation for the HTR project
7 stars 4 forks source link

Select GPU #15

Closed gabays closed 1 year ago

gabays commented 1 year ago

Hey @pkzli,

How can we select a specific GPU, or a type of GPU? I see that some are more powerful than others:

https://doc.eresearch.unige.ch/hpc/hpc_clusters

To get better results, it is better to process larger images with the yolov5x6 model, but it then requires much more memory and I keep having RuntimeError: DataLoader worker is killed by signal: Killed. I can reduce the size of the batch but I would lower my F1 score.

I have been told I would need at least 24GB. I can add --mem=24G to salloc but I am allocated the gpu[007] which (supposedely) is a P100 with 12GB

gabays commented 1 year ago

Additionnaly: does the --ntasks argument have to match the amount of workers? If I have asked --tasks=4 can I run --workers 6?

pkzli commented 1 year ago

For the first question, you have to specify that you want a GPU with >= 24GB. The --mem option set the minimum amount of CPU RAM, not GPU. You should follow those instructions to select a GPU with a minimum quantity of RAM https://doc.eresearch.unige.ch/hpc/slurm#gpgpu_jobs

For example

--gres=gpu:1,VramPerGpu:24G

indicates that you want one GPU with a minimum of 24G.

Now for the second question, there are mainly two ways of doing parallelism on a CPU : threads and processes. I understand that the --workers parameter specify the number of threads used when running on a CPU. With slurm, the --ntasks specify the number of processes (or tasks) to use. If you want to use multiple threads you should use the option --cpus-per-task set to the same value as --workers. Hope this helps.

pkzli commented 1 year ago

To add something, I think at some point I told you to use ntask=4 to avoid DataLoader crash. In fact this was just increasing the CPU memory available, because de more tasks are allocated, the more memory is allocated. In fact the correct way to do is to set --mem to a large enough value and --cpus-per-task to the same value as --workers. And if you use GPU, set VramPerGpu to a value at least equal to --mem.

gabays commented 1 year ago

OK. Thx a lot. I am currently updating the documentation to explain all this to future users, and avoid to many mails/issues in the future

gabays commented 1 year ago

any recommended value for the memory if you happened to train a model?

pkzli commented 1 year ago

Honestly... no idea. I have few expertise in this field. My only advice would be to try with a "reasonable" value (whatever that means, like 10GB) and increase in case of problem.

gabays commented 1 year ago

Command is not working:

(kraken-env) (yggdrasil)-[gabays@login1 ~]$ salloc --partition=shared-gpu --time=01:00:00 --gpus=1 --mem=10GB --cpus-per-task=8 --gres=gpu:1,VramPerGpu:10G
salloc: error: Invalid generic resource (gres) specification
pkzli commented 1 year ago

Seems to work on Baobab but not Yggdrasil. I just sent an email to the hpc team.

gabays commented 1 year ago

Problems solved in #17 with:

salloc --partition=shared-gpu --time=01:00:00 --gpus=1 --mem=24GB --cpus-per-task=8 --gres=gpu:1,VramPerGpu:24G