MIC-DKFZ / nnUNet

Apache License 2.0
5.91k stars 1.76k forks source link

Trying to train with GPU instead of CPU #2429

Open tedi14 opened 3 months ago

tedi14 commented 3 months ago

Hello, I am a student working on a project trying to understand how nnUNet works with TotalSegmentator. I was wondering how am I supposed to type in a command so the training happens on the GPUs. Also I wanted to ask how can I get to see the contouring it has made after the training? I am quite new to all this any help and guidance will be very appreciated :) I asked on the TotalSegmentator GitHub but I am still not sure so I thought it will be worth asking here. It always seems to go onto 100% usage of the CPU even if I add cuda:0 into the command for training. How can I make it so it trains on multiple GPUs instead of the CPU. I am also not sure on how to get rstruct contours from my own pretrained_weights.

It keeps on giving this type of Error image

Also not sure whats happening here:

image

gaojh135 commented 3 months ago

you can try this command: CUDA_VISIBLE_DEVICES=0 nnuNetv2 train 800 3d fullres 4 --disable checkpointing -tr nnUNetTrainer 1epoch if you want to use multiple GPUs for training,you can do like this: nnUNetv2_train DATASET_NAME_OR_ID 2d 0 [--npz] -num_gpus X

Using multiple GPUs for training ->how_to_use_nnunet

tedi14 commented 3 months ago

Hi, Thanks for the reply It tells me that CUDA_VISIBLE_DEVICES isn't a recognised command. Have I not downloaded something?

gaojh135 commented 3 months ago

Yeah, CMD can not recognize the command. You can try running command on Git bash.

tedi14 commented 3 months ago

I still seem to get a similar issue image

gaojh135 commented 3 months ago

try to set batch size = 1?

tedi14 commented 3 months ago

It doesn't change the error still the same issue

gaojh135 commented 3 months ago

Can you take a complete screenshot? Start by entering commands. Try using the default trainer?

tedi14 commented 3 months ago

image image

and here is with regular trainer

image image

gaojh135 commented 3 months ago

I noticed this error: "Not enough memory resources are available to process this command." How much memory do you have available when running this command? Were you able to observe any memory spikes or instances where the memory was fully utilized?

tedi14 commented 3 months ago

Hi sorry for the late reply there is 96GB of memory available wouldn't that be enough?

gaojh135 commented 3 months ago

Have you tried running it on Ubuntu? Windows often has various issues like this, which I frequently encounter as well during execution. It might be due to compatibility problems. Someone else has encountered a similar issue as well. https://github.com/MIC-DKFZ/nnUNet/issues/1652

LalithShiyam commented 3 months ago

Hi @tedi14,

You need to check the gpu utilisation. Please either use nvidia-smi or some option available in windows to see how much GPU is being used and what is using the GPU. If its a memory lock, you need to release it and then it should work - probably after a restart ;)

Cheers, Lalith

haosun-cb commented 1 month ago

For the second issue if it is not solved yet:

We encountered exactly the same problem today, on a system of 96GB RAM. It turns out to be the data uncompressing issue (or plan and preprocess issue). We are not sure about the root cause, however, it was solved after removing all the .npy (not *_seg.npy) under nnUNet_preprocessed folder.

The training script then redid the dataset uncompressing and it worked. So would recommend try clear up enough disk space, delete your dataset under preprocessed folder. Retry plan and preprocess with a smaller -np number.

TaWald commented 1 week ago

Hey @tedi14 is this issue still persisting? If yes it would be great if you could post what the issue was and close this issue.