Stuck on unpacking dataset when submit to SLURM

icekang commented 5 months ago

I got a very strange error where it took forever to unpack the dataset, (I checked and all the data have been extracted, something is holding it back). When I ran an interactive job on SLURM, this did not happen and it could proceed to the training process.

When I submit the job

/home/gridsan/nchutisilp/.conda/envs/nnunet/bin/python
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-05-02 11:00:54.921198: do_dummy_2d_data_aug: False
using pin_memory on device 0
using pin_memory on device 0
2024-05-02 11:01:06.563629: Using torch.compile...
/home/gridsan/nchutisilp/.local/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPreprocessPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [112, 160, 128], 'median_image_size_in_voxels': [375.0, 498.0, 498.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.PlainConvUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'n_conv_per_stage': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}, 'deep_supervision': True}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True}

These are the global plan.json settings:
 {'dataset_name': 'Dataset300_Lumen_and_Wall_OCT', 'plans_name': 'nnUNetPreprocessPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [375, 498, 498], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 0.4977976381778717, 'mean': 0.13332499563694, 'median': 0.08871842920780182, 'min': 0.0, 'percentile_00_5': 0.014754901640117168, 'percentile_99_5': 0.4977976381778717, 'std': 0.12022686749696732}}}

2024-05-02 11:01:08.285077: unpacking dataset...
slurmstepd-d-8-1-2: error: *** JOB 25932259 ON d-8-1-2 CANCELLED AT 2024-05-02T11:05:11 ***

When I ran interactively

/home/gridsan/nchutisilp/.conda/envs/nnunet/bin/python                                                                                                                                      Using device: cuda:0
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.                                                                                                                                                                            #######################################################################

2024-05-02 11:06:01.946635: do_dummy_2d_data_aug: False
using pin_memory on device 0
using pin_memory on device 0
2024-05-02 11:06:06.463378: Using torch.compile...
/home/gridsan/nchutisilp/.local/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPreprocessPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [112, 160, 128], 'median_image_size_in_voxels': [375.0, 498.0, 498.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.PlainConvUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'n_conv_per_stage': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}, 'deep_supervision': True}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True}

These are the global plan.json settings:
 {'dataset_name': 'Dataset300_Lumen_and_Wall_OCT', 'plans_name': 'nnUNetPreprocessPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [375, 498, 498], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 0.4977976381778717, 'mean': 0.13332499563694, 'median': 0.08871842920780182, 'min': 0.0, 'percentile_00_5': 0.014754901640117168, 'percentile_99_5': 0.4977976381778717, 'std': 0.12022686749696732}}}

2024-05-02 11:06:08.217497: unpacking dataset...
2024-05-02 11:06:13.943869: unpacking done...
2024-05-02 11:06:13.970782: Unable to plot network architecture: nnUNet_compile is enabled!
2024-05-02 11:06:14.154500:
2024-05-02 11:06:14.157349: Epoch 0
2024-05-02 11:06:14.158883: Current learning rate: 0.01
2024-05-02 11:08:49.526009: Validation loss improved from 1000.00000 to -0.18658! Patience: 0/50
2024-05-02 11:08:49.528446: train_loss 0.0436
2024-05-02 11:08:49.530672: val_loss -0.1866
2024-05-02 11:08:49.531942: Pseudo dice [0.8128, 0.7267]
2024-05-02 11:08:49.533060: Epoch time: 155.38 s
2024-05-02 11:08:49.534036: Yayy! New best EMA pseudo Dice: 0.7698

2024-05-02 11:10:03.177167: Epoch 2                                                                                                                                                         2024-05-02 11:10:03.179004: Current learning rate: 0.00998                                                                                                                                  2024-05-02 11:11:12.598394: Validation loss improved from -0.39275 to -0.45069! Patience: 0/50                                                                                              2024-05-02 11:11:12.599910: train_loss -0.3798                                                                                                                                              2024-05-02 11:11:12.601211: val_loss -0.4507                                                                                                                                                2024-05-02 11:11:12.602471: Pseudo dice [0.9053, 0.7822]                                                                                                                                    2024-05-02 11:11:12.603442: Epoch time: 69.43 s                                                                                                                                             2024-05-02 11:11:12.604787: Yayy! New best EMA pseudo Dice: 0.7825                                                                                                                          2024-05-02 11:11:14.180511:                                                                                                                                                                 2024-05-02 11:11:14.182289: Epoch 3                                                                                                                                                         2024-05-02 11:11:14.183425: Current learning rate: 0.00997                                                                                                                                  2024-05-02 11:12:23.739094: Validation loss did not improve from -0.45069. Patience: 1/50                                                                                                   2024-05-02 11:12:23.741414: train_loss -0.4171                                                                                                                                              2024-05-02 11:12:23.743047: val_loss -0.4401                                                                                                                                                2024-05-02 11:12:23.744175: Pseudo dice [0.909, 0.7657]                                                                                                                                     2024-05-02 11:12:23.745358: Epoch time: 69.56 s                                                                                                                                             2024-05-02 11:12:23.746772: Yayy! New best EMA pseudo Dice: 0.7879   
...

The submission script I used for both job submission and running interactively.

#!/bin/bash

# Activating the conda environment
source activate nnunet
which python

# Setup env variables nn_UNet
export nnUNet_raw="/home/gridsan/nchutisilp/datasets/nnUNet_Datasets/nnUNet_raw"
export nnUNet_preprocessed="/home/gridsan/nchutisilp/datasets/nnUNet_Datasets/nnUNet_preprocessed"
export nnUNet_results="/home/gridsan/nchutisilp/datasets/nnUNet_Datasets/nnUNet_results"

# run the script
# nnUNetv2_preprocess -d 300 -plans_name nnUNetPreprocessPlans -c 2d 3d_fullres -np 8 4 --verbose
nnUNetv2_train 300 3d_fullres all -p nnUNetPreprocessPlans

icekang commented 5 months ago

It was a misunderstanding. I monitor the log from SLURM, but the output of the nnUNet_training was redirected to the text file and the main process (which I think SLURM did not capture as it tracked different system output.)

moeinheidari7829 commented 5 months ago

It was a misunderstanding. I monitor the log from SLURM, but the output of the nnUNet_training was redirected to the text file and the main process (which I think SLURM did not capture as it tracked different system output.)

I am having exactly the same problem, can you please clarify on what solved the issue. Thanks

FabianIsensee commented 5 months ago

On a cluster environment the reported output (stdout) is often incomplete. This is why we have the training log files. Please just look at those and the progress.png plots to assess whether a job is running correctly

moeinheidari7829 commented 5 months ago

On a cluster environment the reported output (stdout) is often incomplete. This is why we have the training log files. Please just look at those and the progress.png plots to assess whether a job is running correctly

Thank you for your prompt response. The problem is that the training stuck in the following lines and no changes happen in log/progress.png files:

2024-05-06 22:26:35.217165: unpacking done... 2024-05-06 22:26:35.218131: do_dummy_2d_data_aug: False 2024-05-06 22:26:35.286218: Unable to plot network architecture: 2024-05-06 22:26:35.286608: No module named 'hiddenlayer' 2024-05-06 22:26:35.416393: 2024-05-06 22:26:35.416793: Epoch 0 2024-05-06 22:26:35.417209: Current learning rate: 0.01 using pin_memory on device 0

However, using interactive gpus, the process runs okay and the model is trained.

MIC-DKFZ / nnUNet

Stuck on unpacking dataset when submit to SLURM #2155

When I submit the job

When I ran interactively