MIC-DKFZ / nnUNet

Apache License 2.0
6k stars 1.78k forks source link

Prediction of getting stuck when encountering large files #1828

Closed goodsave closed 12 months ago

goodsave commented 1 year ago

Prediction of getting stuck when encountering large files

When I execute nnUNet on the host host host_ When using the predict command, although encountering a large file may take a bit longer, it can ultimately be successfully executed and output a nifty file.

But when I execute nnUNet in the Docker container_ When using the predict command, it stops moving after printing "separate z: False losses axis None". I checked and the CPU occupied by the Docker container at this time was not occupied. As shown in the following figure:

nnunet-bug

When I force the process to end, the printed information is as follows: nnunet-bug2

It should be stuck in this sentence: “[i.get() for i in results]”

The corresponding code should be this paragraph:

` bytes_per_voxel = 4 if all_in_gpu: bytes_per_voxel = 2 # if all_in_gpu then the return value is half (float16) if np.prod(softmax.shape) > (2e9 / bytes_per_voxel 0.85): # 0.85 just to be save print( "This output is too large for python process-process communication. Saving output temporarily to disk") np.save(output_filename[:-7] + ".npy", softmax) softmax = output_filename[:-7] + ".npy"

    results.append(pool.starmap_async(save_segmentation_nifti_from_softmax,
                                      ((softmax, output_filename, dct, interpolation_order, region_class_order,
                                        None, None,
                                        npz_file, None, force_separate_z, interpolation_order_z),)
                                      ))

print("inference done. Now waiting for the segmentation export to finish...")
_ = [i.get() for i in results]
# now apply postprocessing
# first load the postprocessing properties if they are present. Else raise a well visible warning
if not disable_postprocessing:
    results = []
    pp_file = join(model, "postprocessing.json")
    if isfile(pp_file):
        print("postprocessing...")`

The complete log is as follows:

Please cite the following paper when using nnUNet:

Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nat Methods (2020). https://doi.org/10.1038/s41592-020-01008-z

If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

using model stored in /home/allen/nnUNet/RESULTS_FOLDER/nnUNet/3d_fullres/Task301_HeadNeckOAR_20210722/nnUNetTrainerV2__nnUNetPlansv2.1 This model expects 1 input modalities for each image Found 1 unique case ids, here are some examples: ['DATA'] If they don't look right, make sure to double check your filenames. They must end with _0000.nii.gz etc number of cases: 1 number of cases that still need to be predicted: 1 emptying cuda cache loading parameters for folds, ['all'] 2023-12-04 15:53:28.348381: Using dummy2d data augmentation using the following model files: ['/home/allen/nnUNet/RESULTS_FOLDER/nnUNet/3d_fullres/Task301_HeadNeckOAR_20210722/nnUNetTrainerV2__nnUNetPlansv2.1/all/model_final_checkpoint.model'] starting preprocessing generator starting prediction... preprocessing /home/allen/DATA.nii.gz using preprocessor GenericPreprocessor before crop: (1, 195, 512, 512) after crop: (1, 195, 512, 512) spacing: [2.50000095 1.08398402 1.08398402]

no separate z, order 3 no separate z, order 1 before: {'spacing': array([2.50000095, 1.08398402, 1.08398402]), 'spacing_transposed': array([2.50000095, 1.08398402, 1.08398402]), 'data.shape (data is transposed)': (1, 195, 512, 512)} after: {'spacing': array([3. , 1.16308594, 1.16308594]), 'data.shape (data is resampled)': (1, 163, 477, 477)}

(1, 163, 477, 477) This worker has ended successfully, no errors to report predicting /home/allen/DATA.nii.gz debug: mirroring True mirror_axes (0, 1, 2) step_size: 0.5 do mirror: True data shape: (1, 163, 477, 477) patch size: [ 48 192 192] steps (x, y, and z): [[0, 23, 46, 69, 92, 115], [0, 95, 190, 285], [0, 95, 190, 285]] number of tiles: 96 computing Gaussian prediction done This output is too large for python process-process communication. Saving output temporarily to disk inference done. Now waiting for the segmentation export to finish... force_separate_z: None interpolation order: 1 separate z: False lowres axis None no separate z, order 1 ^CProcess ForkPoolWorker-4: Process ForkPoolWorker-2: Traceback (most recent call last): File "/usr/local/bin/nnUNet_predict", line 33, in sys.exit(load_entry_point('nnunet', 'console_scripts', 'nnUNet_predict')()) File "/code/docker/nnUNet-master/nnunet/inference/predict_simple.py", line 221, in main step_size=step_size, checkpoint_name=args.chk) File "/code/docker/nnUNet-master/nnunet/inference/predict.py", line 664, in predict_from_folder disable_postprocessing=disable_postprocessing) File "/code/docker/nnUNet-master/nnunet/inference/predict.py", line 269, in predictcases = [i.get() for i in results] File "/code/docker/nnUNet-master/nnunet/inference/predict.py", line 269, in Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, *self.kwargs) = [i.get() for i in results] File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker task = get() File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get with self._rlock: File "/usr/lib/python3.6/multiprocessing/pool.py", line 638, in get File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() Traceback (most recent call last): KeyboardInterrupt File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(self._args, **self._kwargs) File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker task = get() File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get res = self._reader.recv_bytes() File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) KeyboardInterrupt self.wait(timeout) File "/usr/lib/python3.6/multiprocessing/pool.py", line 635, in wait self._event.wait(timeout) File "/usr/lib/python3.6/threading.py", line 551, in wait signaled = self._cond.wait(timeout) File "/usr/lib/python3.6/threading.py", line 295, in wait waiter.acquire() KeyboardInterrupt 2

May I ask if you can help solve this problem?

ancestor-mithril commented 1 year ago

Check https://github.com/pytorch/pytorch#docker-image. Maybe your docker image has limited shared memory.

goodsave commented 12 months ago

thank u~ I am aware of my issue (caused by NFS file system) and I have resolved it. thank u for your help.