Errors when training on google cloud - possibly full CPU RAM?

chrisrapson commented 2 years ago

Each of my training runs so far has been interrupted by the following error:

fuse: writeMessage: no such file or directory [16 0 0 0 218 255 255 255 220 184 69 86 0 0 0 0]

The first half of the number array seems to be the same every time, but the numbers in the second half sometimes change. The message repeats roughly 20 times before the training exits. I've seen a couple of different messages related to the exit:

ReadFile: operation canceled, fh.reader.ReadAt: readFull: not retrying Read("nnunet/nnUNet_preprocessed/Task101_Hip/nnUNetData_plans_v2.1_stage1/2008_2041.npy", 1657233746627223): net/http: request canceled

or

Exception in thread Thread-5:
...
File "/opt/conda/lib/python3.7/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("Abort event was set. So someone died and we should end this madness. \nIMPORTANT:

So far it has occurred at different stages during training:

after 200 epochs, where max_num_epochs was set to 200 and fold=0
after 583 epochs, where max_num_epochs was unset, so 1000 and fold=all
after 50 epochs, when restarting the previous run from scratch (max_num_epochs still 1000 and fold=all)
after 1 more epoch (number 51) when restarting after adding the -c flag. After re-re-starting, it's now up to epoch 53.

Given that nobody else has reported this problem, but it's reproducible (if not deterministic) for me, I am assuming that it's something to do with the cloud environment. I don't think it's the file system (which is automatically mounted from google cloud storage), since files are written without any issue for some number of epochs before failing. I'm not using spot instances, so I wouldn't expect the mount to be disconnected. Could it be an OOM problem for the CPU RAM? I am using the "n1-highcpu-16" machine type, which I just realised has only 14.4GB of CPU RAM. The resource monitoring for the cloud is showing very high CPU memory utilization. In a local environment, when I run out of CPU RAM, I'm used to either the program using swap and slowing down, or crashing immediately. This looks like one process has written to the memory where another process expects to find a filename?

I'll try switching to a machine type with more RAM, and if this issue doesn't crop up again, I guess that will answer my question.

FabianIsensee commented 2 years ago

Hi, honestly I don't know what is going on. This could be related to the file system (net/http: request canceled sounds strange to me) or as you said to out-of-memory issues. We have never had similar problems with our NFS file systems in our compute cluster. Has increasing the amount of RAM solved your problem? Best, Fabian

chrisrapson commented 2 years ago

Yeah, I haven't had this problem crop up again since increasing the RAM from 14.4GB to 60GB.

I also had another strange problem running inference which was resolved by increasing RAM. Again, no relevant error messages, it just seemed to hang. That was a local desktop PC with 32GB RAM and 30GB of swap. I thought that would be plenty, but then I am working with some fairly large CT scans. As a test, I added an unused SSD drive and mounted it as additional swap. Now everything is running smooth, and the extra swap is being used. It looks like I'll be investing in some additional RAM there too.

That may be indicative of a memory leak, but I can't figure it out. I tried running the inference in a debugger, and I could watch the RAM usage increase during this loop: https://github.com/MIC-DKFZ/nnUNet/blob/fd58e25a304e2ab0cd4c16d2c79505c2882e3593/nnunet/network_architecture/neural_network.py#L374-L394 I think that's writing to pre-allocated memory, so I didn't understand why the memory usage was expanding.

FabianIsensee commented 2 years ago

Hi, I have tested all this code extensively and I can assure you there is no memory leak. The issue lies in large data. The prediction itself is not very RAM intensive (unless oyu have loads of classes) and your system should be more than enough. The resampling to the original resolution, however, is! If you are unable to run the inference, you might want to consider switching from the default nnU-net resampling scheme to something like nearest neighbor interpolation.

in predict_cases you will find the following lines:

        results.append(pool.starmap_async(save_segmentation_nifti_from_softmax,
                                          ((softmax, output_filename, dct, interpolation_order, region_class_order,
                                            None, None,
                                            npz_file, None, force_separate_z, interpolation_order_z),)
                                          ))

Set interpolation_order to 0 and force_separate_z to False and your memory fitprint will be reduced, at the cost of coarser segmentations (much more so in 3d_lowres than 3d_fullres). (Note: only mess with interpolation parameters if the other solutions below did not solve the problem!)

There is another thing that can happen which is the GPU predicting the images faster than the background workers can export them. In this case a lot of softmax predictions need to be stored in RAM. You can temporarily circumvent this by just saving them to disk. Again in predict_cases there are the following lines:

        bytes_per_voxel = 4
        if all_in_gpu:
            bytes_per_voxel = 2  # if all_in_gpu then the return value is half (float16)
        if np.prod(softmax.shape) > (2e9 / bytes_per_voxel * 0.85):  # * 0.85 just to be save
            print(
                "This output is too large for python process-process communication. Saving output temporarily to disk")
            np.save(output_filename[:-7] + ".npy", softmax)
            softmax = output_filename[:-7] + ".npy"

If you set 2e9 / bytes_per_voxel * 0.85 to 0 all predictions will be saved to disk instead of keeping them in RAM. Fast disk (SSD) recommended.

Finally, you may want to reduce the number of workers used for resampling segmentations: --num_threads_nifti_save XX.

If could be that the export/resampling will run for much longer than the inference itself, so verifying top/htop for CPU activity should be done to see whether things hang or not.

Best, Fabian

chrisrapson commented 2 years ago

Thanks for all the suggestions - and in general, all the time you put in to respond to people here! I'll try them out and let you know how it turns out.

FabianIsensee commented 2 years ago

Always happy to help!

dojoh commented 1 year ago

Hello,

Its this question still relevant? Otherwise, I would close the issue.

Cheers Ole

chrisrapson commented 1 year ago

The original questions is resolved. The follow-up question about RAM usage during inference was still not clear to me. I think the RAM usage was increasing before reaching the post-processing stage, or exporting to npy/nifti. I never did figure it out completely, and ended up just sweeping it under the rug by providing enough RAM.

Pragmatically, I'm happy for the issue to be marked solved, and I can open it again later if I have more information to add.

FabianIsensee commented 1 year ago

Hey, so the bulk of the RAM usage will be caused by the resampling of the softmax probabilities to the original image shape. This typically happens towards the end, before the postprocessing. It can be reduced by reducing the number of workers used for exporting segmentations. This will however cause an increased run time. Does that answer your question?

chrisrapson commented 1 year ago

Yes, I've seen the RAM usage spike when post-processing in the function that exports to nifti.

If I remember right (it was a year ago now, so take it with a grain of salt) I tried running inference for just one image, and set a breakpoint in the for loop above. Each time through the loop, the RAM usage increased by a small amount. While paused, the RAM was unchanged, so I don't think there was another process or thread causing it.

But I never looked into it in enough detail to provide a useful breakdown. It was easier just to make the problem go away by adding more RAM. Especially since it doesn't seem to affect anybody else. I'm happy to close this ticket.

FabianIsensee commented 1 year ago

By loop I presume you refer to this? https://github.com/MIC-DKFZ/nnUNet/blob/fd58e25a304e2ab0cd4c16d2c79505c2882e3593/nnunet/network_architecture/neural_network.py#L374-L394

This loop is the one where all the predictions are generated. In principle the arrays we are writing to should be preallocated, but in practice torch does not reserve the required memory (at least thats my assumption based on my observations). This means that it is expected that the memory usage increases as more and more of the target image gets predicted.

Are you seeing the same problem with nnU-Net v2?

chrisrapson commented 1 year ago

Yes, that's what I was referring to. It would make sense if that's known torch behaviour.

I haven't made the switch to v2 yet, so I can't comment.

MIC-DKFZ / nnUNet

Errors when training on google cloud - possibly full CPU RAM? #1128