Regarding the problem that generating summary.json is stuck for a long time during training

xl-lei commented 8 months ago

Hello, when training, I often get stuck in the verification part of the verification set. At this time, the predictions are completed, but summary.json is not generated, and the GPU utilization is 0 for a long time.

TaWald commented 8 months ago

Can you be more precise?

What do you mean by "stuck" -- Does it finish eventually or does it idle?
Does it also occur when you only do validation?

Without more details we can't help you and only guess.

Maybe one of your multi-processed workers are stuck and the Pool cannot be joined?

xl-lei commented 8 months ago

Regardless of whether it is training or only verification, when verifying the verification set, sometimes it will happen that all samples have not been predicted, the prediction will no longer be performed, and the GPU usage will be 0 for a long time. At this time, the program does not stop automatically and is still running.

xl-lei commented 8 months ago

Now I choose to use ‘Ctrl C’ to stop the program, and the program stops at this part. Has the program been stuck in this place before? QQ截图20240111151205

TaWald commented 8 months ago

Seems like one of your workers dies due to too high memory consumption at inference time. Also it seems like you train with STU-Net which is a large architecture and you should probably open an issue with them as this is likely specific due to their larger architecture you train with.

Aside from this: General tips are that you should reduce the number of processes that you do inference with so you don't have too many input images open and in memory at the same time.

xl-lei commented 8 months ago

Hello, the program in the picture above was stopped manually by me using ‘ctrl +c’. I I re-ran the program and observed the files generated during the verification and found that always when all the data is about to be predicted, there are a few left, and then the remaining ones cannot be predicted to be completed. For example, in these files, I used -val for the verification set part of amos2022, which has 120 pieces of data. The first 118 pictures here are all finished quickly, but the program no longer predicts the remaining ones around 20:45, and the GPU utilization is also 0 for a long time. QQ截图20240111211529 QQ截图20240111211714 I want to know why this anomaly happens? There are only a few images left. I don’t think it’s because of memory issues. Could you please help me solve this problem? Thank you very much!

ancestor-mithril commented 8 months ago

Monitor the CPU & RAM usage. Maybe some cases are very large and do not fit into RAM memory. After they start using the swap memory, the operations become very slow due to context switches. Another possible issue is that the shared memory size is not big enough if you are using a docker container.

ancestor-mithril commented 8 months ago

A possible solution is to use nnUNet_def_n_proc=2 to reduce the number of processes used for the validation.

xl-lei commented 8 months ago

Could you tell me where should I set this parameter? Thank you. @ancestor-mithril

ancestor-mithril commented 8 months ago

Could you tell me where should I set this parameter? Thank you. @ancestor-mithril

nnUNet_def_n_proc=2 nnUNetv2_train ...

Also monitor the RAM and CPU usage if the validation is stuck.

TaWald commented 8 months ago

Thanks for pitching in @ancestor-mithril!

As ancestor-mithril suggested, decrease the worker numbers, so you don't run into these memory issues and monitor your RAM usage so you don't have to wait 30 minutes to notice something is wrong.

Another low-effort path you could go to, would be to copy over the remaining cases into a new directory and start inference there. This way you can narrow down the problematic case to one of the remaining cases, which makes debugging easier ( -- if it is not the OOM issue)

TaWald commented 8 months ago

@xinglianglei Were you able to solve your problem?

TaWald commented 7 months ago

@xinglianglei In case no update will be given, I will close this issue due to inactivity in the coming days.

Zhack47 commented 6 months ago

We had this same issue on our side, will soon report if the fixes were effective

Zhack47 commented 6 months ago

The problem was solved on our side by using nnUNet_def_n_proc=2 We did not observe any suspicious memory activity while having the issue

MIC-DKFZ / nnUNet

Regarding the problem that generating summary.json is stuck for a long time during training #1885