nnUnetv2 training freezes after the first epoch

gergo7633 commented 10 months ago

Dear Fabian,

We created a model of automated intracranial metastasis segmentation with nnUnetv2 based on postcontrast 3D_T1 images with high accuracy (91%).

Wanted to improve the model, so we added T2, SWI, ADC, and FLAIR images, all coregistered to the 3D_T1 and increased the number of subjects significantly.

However the training always hangs after the first epoch. CPU and GPU utilization is high for the first epoch (epcoh 0), than it goes back to idle while moving to the next epoch (epoch 1) and after a minute of processing the epoch, utilization drops to idle, the terminal stays open for 5-10 minutes and finally the terminal closes automatically. The epoch time is extremely long ~ 200-260 sec for an RTX4080 and ryzen7 5700x. GPU RAM usage is surprisingly low (7-8 GB), the GPU utilization is not as good as with the original dataset, where it was constantly between 60-95%, here it heavily fluctuates between 10-90%.

I use nnUNet_n_proc_DA=12 and OMP_NUM_THREADS=1 for all trains.

Any help would be appreciated.

Regards,

Gergő

ykirchhoff commented 10 months ago

Hi Gergö,

it sounds like your problem is somehow related to io, sometimes problems arise, when including more modalities. Sometimes it helps to decrease the number of processes for data augmentation. I am not exactly sure, what is going wrong but it seems like different processes block each other. Can you try running the training with nnUNet_n_proc_DA=4 or a similar value?

Best, Yannick

gergo7633 commented 10 months ago

Dear Yannick,

Thanks, I'm running it at the moment. Will report back with the results.

I have a warning although I think it is unrelated to this issue:

/home/kutato/miniconda3/envs/nnunet/lib/python3.11/site-packages/torch/onnx/symbolic_helper.py:1513: UserWarning: ONNX export mode is set to TrainingMode.EVAL, but operator 'instance_norm' is set to train=True. Exporting with train=True. warnings.warn(

gergo7633 commented 10 months ago

It seems it is running now. Although I have poor Epoch times: 309, 525, 581, 609

The training would take 5 weeks so I'll try with nnUNet_n_proc_DA=8

gergo7633 commented 10 months ago

Exited again after 7 epochs (with DA=4)

progress

gergo7633 commented 10 months ago

The log is not too verbose:

2024-01-29 08:48:42.751125: unpacking dataset... 2024-01-29 08:48:44.536602: unpacking done... 2024-01-29 08:48:44.537411: do_dummy_2d_data_aug: False 2024-01-29 08:48:44.538070: Using splits from existing split file: /home/kutato/WORK/nnUNET/nnUNet_preprocessed/Dataset010_Gabi/splits_final.json 2024-01-29 08:48:44.538301: The split file contains 5 splits. 2024-01-29 08:48:44.538336: Desired fold for training: 0 2024-01-29 08:48:44.538365: This split has 101 training and 26 validation cases. 2024-01-29 08:48:46.858114:
2024-01-29 08:48:46.858172: Epoch 0 2024-01-29 08:48:46.858257: Current learning rate: 0.01 2024-01-29 08:53:56.767805: train_loss -0.0442 2024-01-29 08:53:56.819833: val_loss -0.1983 2024-01-29 08:53:56.825172: Pseudo dice [0.3593] 2024-01-29 08:53:56.827722: Epoch time: 309.86 s 2024-01-29 08:53:56.827986: Yayy! New best EMA pseudo Dice: 0.3593 2024-01-29 08:54:10.729097:
2024-01-29 08:54:10.732906: Epoch 1 2024-01-29 08:54:10.733010: Current learning rate: 0.00999 2024-01-29 09:02:56.416936: train_loss -0.3515 2024-01-29 09:02:56.642550: val_loss -0.458 2024-01-29 09:02:56.642800: Pseudo dice [0.6809] 2024-01-29 09:02:56.662519: Epoch time: 525.69 s 2024-01-29 09:02:56.669545: Yayy! New best EMA pseudo Dice: 0.3915 2024-01-29 09:03:02.479419:
2024-01-29 09:03:02.479528: Epoch 2 2024-01-29 09:03:02.479643: Current learning rate: 0.00998 2024-01-29 09:12:43.991543: train_loss -0.4517 2024-01-29 09:12:44.105572: val_loss -0.4755 2024-01-29 09:12:44.105880: Pseudo dice [0.5529] 2024-01-29 09:12:44.134376: Epoch time: 581.46 s 2024-01-29 09:12:44.140301: Yayy! New best EMA pseudo Dice: 0.4076 2024-01-29 09:12:54.901318:
2024-01-29 09:12:54.902283: Epoch 3 2024-01-29 09:12:54.902393: Current learning rate: 0.00997 2024-01-29 09:23:04.217646: train_loss -0.5207 2024-01-29 09:23:04.342724: val_loss -0.6291 2024-01-29 09:23:04.343099: Pseudo dice [0.7253] 2024-01-29 09:23:04.354678: Epoch time: 609.31 s 2024-01-29 09:23:04.355056: Yayy! New best EMA pseudo Dice: 0.4394 2024-01-29 09:24:00.775050:
2024-01-29 09:24:00.775135: Epoch 4 2024-01-29 09:24:00.775217: Current learning rate: 0.00996 2024-01-29 09:34:56.977838: train_loss -0.5497 2024-01-29 09:34:57.179372: val_loss -0.5356 2024-01-29 09:34:57.179598: Pseudo dice [0.6538] 2024-01-29 09:34:57.203972: Epoch time: 656.14 s 2024-01-29 09:34:57.204251: Yayy! New best EMA pseudo Dice: 0.4608 2024-01-29 09:35:08.391678:
2024-01-29 09:35:08.402165: Epoch 5 2024-01-29 09:35:08.402269: Current learning rate: 0.00995 2024-01-29 09:45:29.714143: train_loss -0.5386 2024-01-29 09:45:30.007460: val_loss -0.5723 2024-01-29 09:45:30.027127: Pseudo dice [0.7559] 2024-01-29 09:45:30.040258: Epoch time: 621.25 s 2024-01-29 09:45:30.040592: Yayy! New best EMA pseudo Dice: 0.4903 2024-01-29 09:45:44.402809:
2024-01-29 09:45:44.413173: Epoch 6 2024-01-29 09:45:44.424425: Current learning rate: 0.00995 2024-01-29 09:56:08.983552: train_loss -0.5396 2024-01-29 09:56:09.201721: val_loss -0.5167 2024-01-29 09:56:09.202040: Pseudo dice [0.6601] 2024-01-29 09:56:09.217962: Epoch time: 624.46 s 2024-01-29 09:56:09.223112: Yayy! New best EMA pseudo Dice: 0.5073 2024-01-29 09:56:23.203871:
2024-01-29 09:56:23.215884: Epoch 7 2024-01-29 09:56:23.215985: Current learning rate: 0.00994

ykirchhoff commented 10 months ago

Hi Gergö,

then we know at least that the issue is related to the dataloading. But your epoch times are not acceptable at all and what you experience with the increasing epoch times is unfortunately another rather common issue, where training does not really continue after some time. One more thing you can do is to change the dtype of the preprocessed data. By default nnUNet saves it as float32 but you can change that to float16 without loss in performance. It should work out of the box if you just change/add a line in the preprocessing script. Let me know if you need help changing that.

My solution usually is to put these trainings on our A100s, which have sufficient CPU and IO capabilities to handle all data, but the whole idea of nnUNet is to be able to run it on a consumer grade GPU, so that is not perfect at all. For a new version of nnUNet this is definitely something which is on our wishlist!

Best, Yannick

gergo7633 commented 10 months ago

I think i should change the "/nnUNet/nnunetv2/preprocessing/preprocessors/default_preprocessor.py" am I right? I'd be glad if you just send me where to put that extra line.

Is replacing the 5700x with an 5900x mitigate this issue? It would reduce the CPU bottleneck a bit (+4 physical cores).

gergo7633 commented 10 months ago

Oh, I found it in the simpleitk_reader_writer.py

Btw, it works great with only 3 channels. I rerun the preprocessing in float16 maybe I can use all 5 channels that way.

gergo7633 commented 10 months ago

3 channels in float16 with nnUNet_n_proc_DA=12 gives 105-110 sec epoch times. With DA=8 it is ~125 sec. Works stable.

5 channels in float16 with DA=8 gives 195 sec epoch time but exits at the second epoch, same as DA=4 (250sec epoch time)

ykirchhoff commented 10 months ago

A 5900x instead of the 5700x might help but I think it is not only CPU but more IO for multiple channels which becomes a problem and there it probably doesn't make a huge difference. What were your epoch times for float32 with the 3 channel training? I will need to think about your issue and talk to the others if they have any ideas and I will come back to you later.

Best, Yannick

gergo7633 commented 10 months ago

In float32, 3 channel training: DA=4: 177, 225, 212, 202, 211 DA=8: 153, 275, 172, 179, 204, 175, 181, 196, 178, 211, 233, 206 DA=12: 150, crashed on epoch1

Screenshot below shows training with DA=4. Look at the GPU utilization it is very poor. The SWAP is continuously rising (slowly), despite the free ram ( only 25-31% used).

Screenshot from 2024-01-29 17-06-16

I tried float16 data, 3 channels, 2d training, GPU utilization was 92-98% continuously with epoch times of 48-52 sec.

gergo7633 commented 10 months ago

This is how GPU utilization looks like with 2d training, 3 channels, DA=12, float16. Constant 90+%. The graph of GPU utilization is inverted, bars go from the top (0%) to the bottom (100%).

Screenshot from 2024-01-29 17-31-31

ykirchhoff commented 9 months ago

Thanks for the details. I unfortunately still don't really have a good solution for your problem. One more thing you can try as you mentioned your Swap continuously rising is to change your swappiness (see here for how to do that). This basically changes how often files are swapped to disk and might speed up the training. Maybe that will already help a bit with your issue.

Best, Yannick

gergo7633 commented 9 months ago

I've reduced the channels to 3, modified the preprocessing to save 16bit float images, set the nnUNet_n_proc_DA variable to 8 and radically decreased the swappiness to 5. Now the training runs fine, with stable epoch times and no swap usage. GPU utilization is still poor. Around 40-50% in average.

I'll replace the cpu with an 5900X to further mitigate the IO and CPU bottleneck. Hopefully I'll be able to include one or both remaining channels.

progress

ykirchhoff commented 9 months ago

If I see that correctly this was an improvement of ~10 seconds per epoch, right? You can try disabling swap completely for your trainings, if you have enough RAM, just do sudo swapoff -a to turn it off and sudo swapon -a if you want to have it back. Some trainings were only possible without swap on my machine. However, this will still probably not solve your main issue of the IO bottleneck and low GPU util but maybe allow you to train your model with all modalities without it getting stuck at some point. What kind of SSD do you have? The bandwidth of it might also be a limiting factor for IO.

Best, Yannick

gergo7633 commented 9 months ago

With 32bit data, 3 channels, and DA=8 I had fluctuating epoch times between 150 and 270 sec. So the performance gain is even more. So all these tweaks (16 bit, swappiness) helped a lot, especially the swappiness setting.

The RAM usage is pretty slim, 9GB for the applications, 16GB for write cache. The swap usage started at 126MB and now it is 516MB (after 850 epochs). I will further decrease the swappiness to 1 or 0, or turn it swap usage off completely.

The SSD would be fine in theory, it is an ADATA XPG GAMMIX S70 Blade 2TB, Gen4x4 would be capable of 6000+ MB/s read and write speed, but the motherboard chipset is a B450 so doesn't support PCIe 4.0, only 3.0. So the speed is probably cut in half.

Also the SSD temp is not monitored and the S70 is prone to thermal throttling under heavy usage. I do not know if I can check the temp under linux. The SSD came with a PS5 compatible heatsink which is a garbage (thin metal sheet).

gergo7633 commented 9 months ago

Managed to have some free time and replaced the CPU with a 5900X (along with the PSU and the cooling) and have drastic boost in performance. The 115 sec average epoch time dropped to 88-89 sec, with DA=12. The GPU utilization is much better, probably around 60-70% with long 90%+ periods.

The only downside is the enormous amount of heat as compared to the 5700X.

ykirchhoff commented 9 months ago

The SSD is probably fine, even with the B450. Interesting that the CPU makes such a big difference. The 5900x has 12 cores and the 5700x "only" 8, right? Seems like the dataloading makes better use of more cores/threads even when the number of processes is limited. Curious to see if you can now also train with all 5 modalities.

gergo7633 commented 9 months ago

Dear Yannick,

Yes, the 5900x has 12 cores and the 5700x had only 8 and the L3 cache is twice the size (64MB vs 32MB). It runs @ 4.2-4.3 GHz on all cores continuously.

I'll try it with all five modalities, but now I'm finishing the trainings with 3.

I had a lot of troubles with the 5900x (not so easy to work with) had frequent reboots, especially during idle or in cases when some of the cores went to deep c-state (5-6). So it took several days to figure out that I have to disable c-states and PSS too to make the CPU stable. Also had to replace the RAM as there were a lot of RAM errors with the 5900x (there were none with the 5700x which is curious). So I thought that the CPU's memory controller is faulty, but in the meantime I purchased some new ram modules to give them a try and i was surprised that it showed no errors during a 48-hour-long ram test (memtest86+ for 24 hours and OCCT for another 24 hours). The new memory helped to significantly increase the infinity fabric frequency, so now I have epoch times ~70 sec with the 3 channels: progress

I think it does worth spending some time with system-wide hardware (and software) optimizations.

ykirchhoff commented 9 months ago

Hi Gergö,

these are some very interesting findings, I will definitely have a deeper look into that. We also had some problems with RAM from time to time and usually "solved" it by reducing the memory frequency in the bios. 70 seconds sounds really good for your three channels.

Best, Yannick

gergo7633 commented 9 months ago

Dear Yannick,

Finished with the 3 channel trainings, looks promising. The 5 modalities give a nice penalty for the performance, I have stable ~120-125 sec epochs, but at least it runs now without any issues. Thank you for all your help. GPU utilization with 3 channels was excellent ~ 75-80% in average, now it is around 40% with the 5 channels, probably due to the IO and CPU bottleneck.

I know it is not optimal, but I can live with the long epoch times, I'm happy that it works now (It was a 3-week-long journey).

MIC-DKFZ / nnUNet

nnUnetv2 training freezes after the first epoch #1919