MIC-DKFZ / TractSeg

Automatic White Matter Bundle Segmentation
Apache License 2.0
231 stars 75 forks source link

Error in train endings_segmentation. #257

Closed czq0827 closed 1 year ago

czq0827 commented 1 year ago

Hi, I am currently training endings_segmentation and an error occurs when verifying the first epoch. However, tract_segmentation is trained without any problems. Have you ever encountered a similar situation and how should I solve this problem? Looking forward to your reply. Thank you very much.

Here are the parameter Settings and specific error cases: Hyperparameters: {'BATCH_NORM': False, 'BATCH_SIZE': 7, 'BEST_EPOCH': 0, 'BEST_EPOCH_SELECTION': 'f1', 'CALC_F1': True, 'CLASSES': 'All_endpoints', 'CSD_RESOLUTION': 'LOW', 'CV_FOLD': 0, 'DATASET': 'HCP', 'DATASET_FOLDER': 'HCP_preproc_endingseg', 'DATA_AUGMENTATION': True, 'DAUG_ALPHA': (90.0, 120.0), 'DAUG_BLUR_SIGMA': (0, 1), 'DAUG_ELASTIC_DEFORM': True, 'DAUG_FLIP_PEAKS': False, 'DAUG_GAUSSIAN_BLUR': True, 'DAUG_INFO': '-', 'DAUG_MIRROR': False, 'DAUG_NOISE': True, 'DAUG_NOISE_VARIANCE': (0, 0.05), 'DAUG_RESAMPLE': False, 'DAUG_RESAMPLE_LEGACY': False, 'DAUG_ROTATE': False, 'DAUG_ROTATE_ANGLE': (-0.2, 0.2), 'DAUG_SCALE': True, 'DAUG_SIGMA': (9.0, 11.0), 'DIM': '2D', 'DROPOUT_SAMPLING': False, 'EPOCH_MULTIPLIER': 1, 'EXPERIMENT_TYPE': 'endings_segmentation', 'EXP_MULTI_NAME': '', 'EXP_NAME': 'my_custom_experiment', 'EXP_PATH': '/home/bzyl/Downloads/hcp_exp_nodes/my_custom_experiment_x5', 'FEATURES_FILENAME': 'peaks', 'FLIP_OUTPUT_PEAKS': False, 'FP16': True, 'GET_PROBS': False, 'INFO': '-', 'INPUT_DIM': (144, 144), 'INPUT_RESCALING': False, 'KEEP_INTERMEDIATE_FILES': False, 'LABELS_FILENAME': 'endpoints_72_ordered', 'LABELS_FOLDER': 'bundle_masks', 'LABELS_TYPE': 'int', 'LEARNING_RATE': 0.001, 'LOAD_WEIGHTS': False, 'LOSS_FUNCTION': 'default', 'LOSS_WEIGHT': 5, 'LOSS_WEIGHT_LEN': -1, 'LR_SCHEDULE': True, 'LR_SCHEDULE_MODE': 'min', 'LR_SCHEDULE_PATIENCE': 20, 'METRIC_TYPES': ['loss', 'f1_macro'], 'MODEL': 'UNet_Pytorch_DeepSup', 'MULTI_PARENT_PATH': '/home/bzyl/Downloads/hcp_exp_nodes/', 'NORMALIZE_DATA': True, 'NORMALIZE_PER_CHANNEL': False, 'NR_CPUS': -1, 'NR_OF_CLASSES': 144, 'NR_OF_GRADIENTS': 9, 'NR_SLICES': 1, 'NUM_EPOCHS': 150, 'ONLY_VAL': False, 'OPTIMIZER': 'Adamax', 'OUTPUT_MULTIPLE_FILES': False, 'PAD_TO_SQUARE': True, 'PEAK_DICE_LEN_THR': 0.05, 'PEAK_DICE_THR': [0.95], 'PREDICT_IMG': False, 'PREDICT_IMG_OUTPUT': None, 'PRINT_FREQ': 20, 'P_SAMP': 1.0, 'RESET_LAST_LAYER': False, 'RESOLUTION': '1.25mm', 'SAVE_WEIGHTS': True, 'SEGMENT': False, 'SEG_INPUT': 'Peaks', 'SLICE_DIRECTION': 'y', 'SPATIAL_TRANSFORM': 'SpatialTransform', 'TEST': False, 'TEST_TIME_DAUG': False, 'THRESHOLD': 0.5, 'TRACTSEG_DIR': 'tractseg_output', 'TRAIN': True, 'TRAINING_SLICE_DIRECTION': 'xyz', 'TYPE': 'single_direction', 'UNET_NR_FILT': 64, 'UPSAMPLE_TYPE': 'bilinear', 'USE_DROPOUT': False, 'USE_VISLOGGER': False, 'VERBOSE': True, 'WEIGHTS_PATH': '', 'WEIGHT_DECAY': 0} INFO: Did not find APEX, defaulting to fp32 training Training... bzyl Start looping batches... using pin_memory on device 0 train Ep 0, Sp 140, loss 0.262206, t print 11.725s, t batch 0.586s train Ep 0, Sp 280, loss 0.041395, t print 2.57s, t batch 0.129s train Ep 0, Sp 420, loss 0.031893, t print 7.075s, t batch 0.354s train Ep 0, Sp 560, loss 0.026479, t print 5.663s, t batch 0.283s train Ep 0, Sp 700, loss 0.022037, t print 5.011s, t batch 0.251s train Ep 0, Sp 840, loss 0.019789, t print 5.303s, t batch 0.265s train Ep 0, Sp 980, loss 0.021634, t print 5.766s, t batch 0.288s train Ep 0, Sp 1120, loss 0.02191, t print 4.303s, t batch 0.215s train Ep 0, Sp 1260, loss 0.024494, t print 5.506s, t batch 0.275s train Ep 0, Sp 1400, loss 0.02185, t print 6.283s, t batch 0.314s train Ep 0, Sp 1540, loss 0.022225, t print 5.722s, t batch 0.286s train Ep 0, Sp 1680, loss 0.021149, t print 3.564s, t batch 0.178s train Ep 0, Sp 1820, loss 0.020806, t print 6.017s, t batch 0.301s train Ep 0, Sp 1960, loss 0.018725, t print 5.636s, t batch 0.282s train Ep 0, Sp 2100, loss 0.017874, t print 3.965s, t batch 0.198s train Ep 0, Sp 2240, loss 0.020219, t print 5.32s, t batch 0.266s train Ep 0, Sp 2380, loss 0.017503, t print 5.65s, t batch 0.283s train Ep 0, Sp 2520, loss 0.019123, t print 5.741s, t batch 0.287s train Ep 0, Sp 2660, loss 0.019444, t print 4.002s, t batch 0.2s train Ep 0, Sp 2800, loss 0.018482, t print 5.549s, t batch 0.277s train Ep 0, Sp 2940, loss 0.018021, t print 7.013s, t batch 0.351s train Ep 0, Sp 3080, loss 0.017422, t print 4.997s, t batch 0.25s train Ep 0, Sp 3220, loss 0.01777, t print 3.493s, t batch 0.175s train Ep 0, Sp 3360, loss 0.018739, t print 5.823s, t batch 0.291s train Ep 0, Sp 3500, loss 0.016395, t print 5.592s, t batch 0.28s train Ep 0, Sp 3640, loss 0.016898, t print 4.969s, t batch 0.248s train Ep 0, Sp 3780, loss 0.015616, t print 6.844s, t batch 0.342s train Ep 0, Sp 3920, loss 0.019232, t print 2.192s, t batch 0.11s train Ep 0, Sp 4060, loss 0.018167, t print 5.171s, t batch 0.259s train Ep 0, Sp 4200, loss 0.017297, t print 5.092s, t batch 0.255s train Ep 0, Sp 4340, loss 0.018057, t print 6.465s, t batch 0.323s train Ep 0, Sp 4480, loss 0.017742, t print 6.717s, t batch 0.336s train Ep 0, Sp 4620, loss 0.016358, t print 3.163s, t batch 0.158s train Ep 0, Sp 4760, loss 0.01598, t print 4.128s, t batch 0.206s train Ep 0, Sp 4900, loss 0.017195, t print 5.455s, t batch 0.273s train Ep 0, Sp 5040, loss 0.015829, t print 8.734s, t batch 0.437s train Ep 0, Sp 5180, loss 0.016181, t print 4.237s, t batch 0.212s train Ep 0, Sp 5320, loss 0.014548, t print 3.529s, t batch 0.176s train Ep 0, Sp 5460, loss 0.017627, t print 5.405s, t batch 0.27s train Ep 0, Sp 5600, loss 0.01698, t print 7.771s, t batch 0.389s train Ep 0, Sp 5740, loss 0.015092, t print 3.364s, t batch 0.168s train Ep 0, Sp 5880, loss 0.01646, t print 4.018s, t batch 0.201s train Ep 0, Sp 6020, loss 0.015678, t print 5.611s, t batch 0.281s train Ep 0, Sp 6160, loss 0.015784, t print 4.504s, t batch 0.225s train Ep 0, Sp 6300, loss 0.015868, t print 7.698s, t batch 0.385s train Ep 0, Sp 6440, loss 0.013551, t print 3.726s, t batch 0.186s train Ep 0, Sp 6580, loss 0.015582, t print 4.479s, t batch 0.224s train Ep 0, Sp 6720, loss 0.014889, t print 6.648s, t batch 0.332s train Ep 0, Sp 6860, loss 0.017025, t print 5.984s, t batch 0.299s train Ep 0, Sp 7000, loss 0.015481, t print 3.797s, t batch 0.19s train Ep 0, Sp 7140, loss 0.014413, t print 4.802s, t batch 0.24s train Ep 0, Sp 7280, loss 0.016206, t print 6.212s, t batch 0.311s train Ep 0, Sp 7420, loss 0.014085, t print 5.179s, t batch 0.259s train Ep 0, Sp 7560, loss 0.013343, t print 4.464s, t batch 0.223s train Ep 0, Sp 7700, loss 0.017172, t print 5.497s, t batch 0.275s train Ep 0, Sp 7840, loss 0.014455, t print 3.35s, t batch 0.168s train Ep 0, Sp 7980, loss 0.013615, t print 7.44s, t batch 0.372s train Ep 0, Sp 8120, loss 0.013964, t print 3.377s, t batch 0.169s train Ep 0, Sp 8260, loss 0.014325, t print 3.199s, t batch 0.16s train Ep 0, Sp 8400, loss 0.014094, t print 5.853s, t batch 0.293s train Ep 0, Sp 8540, loss 0.014687, t print 4.994s, t batch 0.25s train Ep 0, Sp 8680, loss 0.012286, t print 7.597s, t batch 0.38s train Ep 0, Sp 8820, loss 0.014523, t print 4.411s, t batch 0.221s train Ep 0, Sp 8960, loss 0.014025, t print 3.619s, t batch 0.181s Start looping batches... OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe. OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe. OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe. OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe. using pin_memory on device 0 Exception in thread Thread-2: Traceback (most recent call last): File "/home/bzyl/software/anaconda3/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/home/bzyl/software/anaconda3/lib/python3.9/threading.py", line 910, in run self._target(*self._args, *self._kwargs) File "/home/bzyl/batchgenerators/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print" RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/home/bzyl/software/anaconda3/bin/ExpRunner", line 7, in exec(compile(f.read(), file, 'exec')) File "/home/bzyl/TractSeg/bin/ExpRunner", line 199, in main() File "/home/bzyl/TractSeg/bin/ExpRunner", line 135, in main trainer.train_model(Config, model, data_loader) File "/home/bzyl/TractSeg/tractseg/libs/trainer.py", line 105, in train_model batch = next(batch_gen_train) if type == "train" else next(batch_gen_val) File "/home/bzyl/batchgenerators/batchgenerators/dataloading/multi_threaded_augmenter.py", line 204, in next item = self.get_next_item() File "/home/bzyl/batchgenerators/batchgenerators/dataloading/multi_threaded_augmenter.py", line 189, in get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-1: Traceback (most recent call last): File "/home/bzyl/software/anaconda3/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/home/bzyl/software/anaconda3/lib/python3.9/threading.py", line 910, in run self._target(self._args, self._kwargs) File "/home/bzyl/batchgenerators/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print" RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message**