Open gergo7633 opened 10 months ago
Hi Gergö,
it sounds like your problem is somehow related to io, sometimes problems arise, when including more modalities. Sometimes it helps to decrease the number of processes for data augmentation. I am not exactly sure, what is going wrong but it seems like different processes block each other. Can you try running the training with nnUNet_n_proc_DA=4
or a similar value?
Best, Yannick
Dear Yannick,
Thanks, I'm running it at the moment. Will report back with the results.
I have a warning although I think it is unrelated to this issue:
/home/kutato/miniconda3/envs/nnunet/lib/python3.11/site-packages/torch/onnx/symbolic_helper.py:1513: UserWarning: ONNX export mode is set to TrainingMode.EVAL, but operator 'instance_norm' is set to train=True. Exporting with train=True. warnings.warn(
It seems it is running now. Although I have poor Epoch times: 309, 525, 581, 609
The training would take 5 weeks so I'll try with nnUNet_n_proc_DA=8
Exited again after 7 epochs (with DA=4)
The log is not too verbose:
2024-01-29 08:48:42.751125: unpacking dataset...
2024-01-29 08:48:44.536602: unpacking done...
2024-01-29 08:48:44.537411: do_dummy_2d_data_aug: False
2024-01-29 08:48:44.538070: Using splits from existing split file: /home/kutato/WORK/nnUNET/nnUNet_preprocessed/Dataset010_Gabi/splits_final.json
2024-01-29 08:48:44.538301: The split file contains 5 splits.
2024-01-29 08:48:44.538336: Desired fold for training: 0
2024-01-29 08:48:44.538365: This split has 101 training and 26 validation cases.
2024-01-29 08:48:46.858114:
2024-01-29 08:48:46.858172: Epoch 0
2024-01-29 08:48:46.858257: Current learning rate: 0.01
2024-01-29 08:53:56.767805: train_loss -0.0442
2024-01-29 08:53:56.819833: val_loss -0.1983
2024-01-29 08:53:56.825172: Pseudo dice [0.3593]
2024-01-29 08:53:56.827722: Epoch time: 309.86 s
2024-01-29 08:53:56.827986: Yayy! New best EMA pseudo Dice: 0.3593
2024-01-29 08:54:10.729097:
2024-01-29 08:54:10.732906: Epoch 1
2024-01-29 08:54:10.733010: Current learning rate: 0.00999
2024-01-29 09:02:56.416936: train_loss -0.3515
2024-01-29 09:02:56.642550: val_loss -0.458
2024-01-29 09:02:56.642800: Pseudo dice [0.6809]
2024-01-29 09:02:56.662519: Epoch time: 525.69 s
2024-01-29 09:02:56.669545: Yayy! New best EMA pseudo Dice: 0.3915
2024-01-29 09:03:02.479419:
2024-01-29 09:03:02.479528: Epoch 2
2024-01-29 09:03:02.479643: Current learning rate: 0.00998
2024-01-29 09:12:43.991543: train_loss -0.4517
2024-01-29 09:12:44.105572: val_loss -0.4755
2024-01-29 09:12:44.105880: Pseudo dice [0.5529]
2024-01-29 09:12:44.134376: Epoch time: 581.46 s
2024-01-29 09:12:44.140301: Yayy! New best EMA pseudo Dice: 0.4076
2024-01-29 09:12:54.901318:
2024-01-29 09:12:54.902283: Epoch 3
2024-01-29 09:12:54.902393: Current learning rate: 0.00997
2024-01-29 09:23:04.217646: train_loss -0.5207
2024-01-29 09:23:04.342724: val_loss -0.6291
2024-01-29 09:23:04.343099: Pseudo dice [0.7253]
2024-01-29 09:23:04.354678: Epoch time: 609.31 s
2024-01-29 09:23:04.355056: Yayy! New best EMA pseudo Dice: 0.4394
2024-01-29 09:24:00.775050:
2024-01-29 09:24:00.775135: Epoch 4
2024-01-29 09:24:00.775217: Current learning rate: 0.00996
2024-01-29 09:34:56.977838: train_loss -0.5497
2024-01-29 09:34:57.179372: val_loss -0.5356
2024-01-29 09:34:57.179598: Pseudo dice [0.6538]
2024-01-29 09:34:57.203972: Epoch time: 656.14 s
2024-01-29 09:34:57.204251: Yayy! New best EMA pseudo Dice: 0.4608
2024-01-29 09:35:08.391678:
2024-01-29 09:35:08.402165: Epoch 5
2024-01-29 09:35:08.402269: Current learning rate: 0.00995
2024-01-29 09:45:29.714143: train_loss -0.5386
2024-01-29 09:45:30.007460: val_loss -0.5723
2024-01-29 09:45:30.027127: Pseudo dice [0.7559]
2024-01-29 09:45:30.040258: Epoch time: 621.25 s
2024-01-29 09:45:30.040592: Yayy! New best EMA pseudo Dice: 0.4903
2024-01-29 09:45:44.402809:
2024-01-29 09:45:44.413173: Epoch 6
2024-01-29 09:45:44.424425: Current learning rate: 0.00995
2024-01-29 09:56:08.983552: train_loss -0.5396
2024-01-29 09:56:09.201721: val_loss -0.5167
2024-01-29 09:56:09.202040: Pseudo dice [0.6601]
2024-01-29 09:56:09.217962: Epoch time: 624.46 s
2024-01-29 09:56:09.223112: Yayy! New best EMA pseudo Dice: 0.5073
2024-01-29 09:56:23.203871:
2024-01-29 09:56:23.215884: Epoch 7
2024-01-29 09:56:23.215985: Current learning rate: 0.00994
Hi Gergö,
then we know at least that the issue is related to the dataloading. But your epoch times are not acceptable at all and what you experience with the increasing epoch times is unfortunately another rather common issue, where training does not really continue after some time. One more thing you can do is to change the dtype of the preprocessed data. By default nnUNet saves it as float32 but you can change that to float16 without loss in performance. It should work out of the box if you just change/add a line in the preprocessing script. Let me know if you need help changing that.
My solution usually is to put these trainings on our A100s, which have sufficient CPU and IO capabilities to handle all data, but the whole idea of nnUNet is to be able to run it on a consumer grade GPU, so that is not perfect at all. For a new version of nnUNet this is definitely something which is on our wishlist!
Best, Yannick
I think i should change the "/nnUNet/nnunetv2/preprocessing/preprocessors/default_preprocessor.py" am I right? I'd be glad if you just send me where to put that extra line.
Is replacing the 5700x with an 5900x mitigate this issue? It would reduce the CPU bottleneck a bit (+4 physical cores).
Oh, I found it in the simpleitk_reader_writer.py
Btw, it works great with only 3 channels. I rerun the preprocessing in float16 maybe I can use all 5 channels that way.
3 channels in float16 with nnUNet_n_proc_DA=12 gives 105-110 sec epoch times. With DA=8 it is ~125 sec. Works stable.
5 channels in float16 with DA=8 gives 195 sec epoch time but exits at the second epoch, same as DA=4 (250sec epoch time)
A 5900x instead of the 5700x might help but I think it is not only CPU but more IO for multiple channels which becomes a problem and there it probably doesn't make a huge difference. What were your epoch times for float32 with the 3 channel training? I will need to think about your issue and talk to the others if they have any ideas and I will come back to you later.
Best, Yannick
In float32, 3 channel training: DA=4: 177, 225, 212, 202, 211 DA=8: 153, 275, 172, 179, 204, 175, 181, 196, 178, 211, 233, 206 DA=12: 150, crashed on epoch1
Screenshot below shows training with DA=4. Look at the GPU utilization it is very poor. The SWAP is continuously rising (slowly), despite the free ram ( only 25-31% used).
I tried float16 data, 3 channels, 2d training, GPU utilization was 92-98% continuously with epoch times of 48-52 sec.
This is how GPU utilization looks like with 2d training, 3 channels, DA=12, float16. Constant 90+%. The graph of GPU utilization is inverted, bars go from the top (0%) to the bottom (100%).
Thanks for the details. I unfortunately still don't really have a good solution for your problem. One more thing you can try as you mentioned your Swap continuously rising is to change your swappiness (see here for how to do that). This basically changes how often files are swapped to disk and might speed up the training. Maybe that will already help a bit with your issue.
Best, Yannick
I've reduced the channels to 3, modified the preprocessing to save 16bit float images, set the nnUNet_n_proc_DA variable to 8 and radically decreased the swappiness to 5. Now the training runs fine, with stable epoch times and no swap usage. GPU utilization is still poor. Around 40-50% in average.
I'll replace the cpu with an 5900X to further mitigate the IO and CPU bottleneck. Hopefully I'll be able to include one or both remaining channels.
If I see that correctly this was an improvement of ~10 seconds per epoch, right? You can try disabling swap completely for your trainings, if you have enough RAM, just do sudo swapoff -a
to turn it off and sudo swapon -a
if you want to have it back. Some trainings were only possible without swap on my machine. However, this will still probably not solve your main issue of the IO bottleneck and low GPU util but maybe allow you to train your model with all modalities without it getting stuck at some point.
What kind of SSD do you have? The bandwidth of it might also be a limiting factor for IO.
Best, Yannick
With 32bit data, 3 channels, and DA=8 I had fluctuating epoch times between 150 and 270 sec. So the performance gain is even more. So all these tweaks (16 bit, swappiness) helped a lot, especially the swappiness setting.
The RAM usage is pretty slim, 9GB for the applications, 16GB for write cache. The swap usage started at 126MB and now it is 516MB (after 850 epochs). I will further decrease the swappiness to 1 or 0, or turn it swap usage off completely.
The SSD would be fine in theory, it is an ADATA XPG GAMMIX S70 Blade 2TB, Gen4x4 would be capable of 6000+ MB/s read and write speed, but the motherboard chipset is a B450 so doesn't support PCIe 4.0, only 3.0. So the speed is probably cut in half.
Also the SSD temp is not monitored and the S70 is prone to thermal throttling under heavy usage. I do not know if I can check the temp under linux. The SSD came with a PS5 compatible heatsink which is a garbage (thin metal sheet).
Managed to have some free time and replaced the CPU with a 5900X (along with the PSU and the cooling) and have drastic boost in performance. The 115 sec average epoch time dropped to 88-89 sec, with DA=12. The GPU utilization is much better, probably around 60-70% with long 90%+ periods.
The only downside is the enormous amount of heat as compared to the 5700X.
The SSD is probably fine, even with the B450. Interesting that the CPU makes such a big difference. The 5900x has 12 cores and the 5700x "only" 8, right? Seems like the dataloading makes better use of more cores/threads even when the number of processes is limited. Curious to see if you can now also train with all 5 modalities.
Dear Yannick,
Yes, the 5900x has 12 cores and the 5700x had only 8 and the L3 cache is twice the size (64MB vs 32MB). It runs @ 4.2-4.3 GHz on all cores continuously.
I'll try it with all five modalities, but now I'm finishing the trainings with 3.
I had a lot of troubles with the 5900x (not so easy to work with) had frequent reboots, especially during idle or in cases when some of the cores went to deep c-state (5-6). So it took several days to figure out that I have to disable c-states and PSS too to make the CPU stable. Also had to replace the RAM as there were a lot of RAM errors with the 5900x (there were none with the 5700x which is curious). So I thought that the CPU's memory controller is faulty, but in the meantime I purchased some new ram modules to give them a try and i was surprised that it showed no errors during a 48-hour-long ram test (memtest86+ for 24 hours and OCCT for another 24 hours). The new memory helped to significantly increase the infinity fabric frequency, so now I have epoch times ~70 sec with the 3 channels:
I think it does worth spending some time with system-wide hardware (and software) optimizations.
Hi Gergö,
these are some very interesting findings, I will definitely have a deeper look into that. We also had some problems with RAM from time to time and usually "solved" it by reducing the memory frequency in the bios. 70 seconds sounds really good for your three channels.
Best, Yannick
Dear Yannick,
Finished with the 3 channel trainings, looks promising. The 5 modalities give a nice penalty for the performance, I have stable ~120-125 sec epochs, but at least it runs now without any issues. Thank you for all your help. GPU utilization with 3 channels was excellent ~ 75-80% in average, now it is around 40% with the 5 channels, probably due to the IO and CPU bottleneck.
I know it is not optimal, but I can live with the long epoch times, I'm happy that it works now (It was a 3-week-long journey).
Dear Fabian,
We created a model of automated intracranial metastasis segmentation with nnUnetv2 based on postcontrast 3D_T1 images with high accuracy (91%).
Wanted to improve the model, so we added T2, SWI, ADC, and FLAIR images, all coregistered to the 3D_T1 and increased the number of subjects significantly.
However the training always hangs after the first epoch. CPU and GPU utilization is high for the first epoch (epcoh 0), than it goes back to idle while moving to the next epoch (epoch 1) and after a minute of processing the epoch, utilization drops to idle, the terminal stays open for 5-10 minutes and finally the terminal closes automatically. The epoch time is extremely long ~ 200-260 sec for an RTX4080 and ryzen7 5700x. GPU RAM usage is surprisingly low (7-8 GB), the GPU utilization is not as good as with the original dataset, where it was constantly between 60-95%, here it heavily fluctuates between 10-90%.
I use nnUNet_n_proc_DA=12 and OMP_NUM_THREADS=1 for all trains.
Any help would be appreciated.
Regards,
Gergő