YixingHuang / DeepMedicPlus

Deep learning for brain metastasis detection and segmentation in longitudinal MRI data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Some trainings run out of memory even for large GPU #9

Closed damiankucharski closed 1 year ago

damiankucharski commented 1 year ago

Hi @YixingHuang, I am getting out of memory errors when training on a little larger datasets. Sometimes trainings do not fail but often they do, it seems more or less random. I have 225 training subjects so the dataset is not that large. I am using 40GB Nvidia A100 GPU so the memory shouldn't be a problem. Do you think that something in code may be causing bad memory management? I am attaching the (truncated due to length) log of one of the failed trainings.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~           Starting new Epoch! Epoch #0/50            ~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

***********************************************************************************
*            Starting new Subepoch: #0/20           *
***********************************************************************************
[MAIN|PID:1932125] MULTIPROC: Before Training in subepoch #0, submitting sampling job for next [TRAINING].
[TRA|SAMPLER|PID:1932125] :=:=:=:=:=:=: Starting to sample for next [Training]... :=:=:=:=:=:=:
[TRA|SAMPLER|PID:1932125] Out of [210] subjects given for [Training], we will sample from maximum [50] per subepoch.
[TRA|SAMPLER|PID:1932125] Shuffled indices of subjects that were randomly chosen: [62, 162, 207, 44, 68, 19, 107, 95, 184, 199, 106, 14, 57, 111, 99, 100, 83, 32, 36, 186, 12, 134, 124, 115, 73, 121, 181, 122, 9, 204, 26, 10, 37, 168, 174, 79, 126, 82, 51, 56, 91, 164, 143, 112, 76, 58, 41, 141, 142, 43]
[TRA|SAMPLER|PID:1932125] Will sample from [50] subjects for next Training...
[TRA|SAMPLER|PID:1932125] ******* Spawning children processes to sample from [50] subjects*******
[TRA|SAMPLER|PID:1932125] MULTIPR: Number of CPUs detected: 64. Requested to use max: [20]
[TRA|SAMPLER|PID:1932125] MULTIPR: Spawning [20] processes to load and sample.
[TRA|JOB:14|PID:1932175] Started. (#14/50) sampling job. Load & sample from subject of index (in user's list): 99
[TRA|JOB:14|PID:1932175] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2305_VISIT 9.nii.gz
[TRA|JOB:14|PID:1932175] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:14|PID:1932175] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:14|PID:1932175] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:14|PID:1932175] TIMING: [Load: 4.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.1] secs
[TRA|JOB:36|PID:1932175] Started. (#36/50) sampling job. Load & sample from subject of index (in user's list): 126
[TRA|JOB:36|PID:1932175] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5701_VISIT 2.nii.gz
[TRA|JOB:36|PID:1932175] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:36|PID:1932175] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:36|PID:1932175] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:36|PID:1932175] TIMING: [Load: 2.5] [Preproc: 0.7] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:13|PID:1932174] Started. (#13/50) sampling job. Load & sample from subject of index (in user's list): 111
[TRA|JOB:13|PID:1932174] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4345_VISIT 2.nii.gz
[TRA|JOB:13|PID:1932174] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:13|PID:1932174] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:13|PID:1932174] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:13|PID:1932174] TIMING: [Load: 4.4] [Preproc: 0.5] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:35|PID:1932174] Started. (#35/50) sampling job. Load & sample from subject of index (in user's list): 79
[TRA|JOB:35|PID:1932174] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1541_POST-TREATMENT VISIT.nii.gz
[TRA|JOB:35|PID:1932174] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:35|PID:1932174] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:35|PID:1932174] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:35|PID:1932174] TIMING: [Load: 3.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:15|PID:1932176] Started. (#15/50) sampling job. Load & sample from subject of index (in user's list): 100
[TRA|JOB:15|PID:1932176] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1121_UNSCHEDULED 1.nii.gz
[TRA|JOB:15|PID:1932176] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:15|PID:1932176] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:15|PID:1932176] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:15|PID:1932176] TIMING: [Load: 3.7] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:32|PID:1932176] Started. (#32/50) sampling job. Load & sample from subject of index (in user's list): 37
[TRA|JOB:32|PID:1932176] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 7843_VISIT 2.nii.gz
[TRA|JOB:32|PID:1932176] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:32|PID:1932176] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:32|PID:1932176] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:32|PID:1932176] TIMING: [Load: 3.6] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.0] [Augm-Samples: 0.1] secs
[TRA|JOB:16|PID:1932177] Started. (#16/50) sampling job. Load & sample from subject of index (in user's list): 83
[TRA|JOB:16|PID:1932177] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 8421_VISIT 4.nii.gz
[TRA|JOB:16|PID:1932177] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:16|PID:1932177] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:16|PID:1932177] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:16|PID:1932177] TIMING: [Load: 2.9] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.6] [Extract Sampl: 0.0] [Augm-Samples: 0.1] secs
[TRA|JOB:31|PID:1932177] Started. (#31/50) sampling job. Load & sample from subject of index (in user's list): 10
[TRA|JOB:31|PID:1932177] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5621_SCREENING.nii.gz
[TRA|JOB:31|PID:1932177] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:31|PID:1932177] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:31|PID:1932177] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:31|PID:1932177] TIMING: [Load: 3.2] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:7|PID:1932168] Started. (#7/50) sampling job. Load & sample from subject of index (in user's list): 95
[TRA|JOB:7|PID:1932168] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4344_VISIT 10.nii.gz
[TRA|JOB:7|PID:1932168] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:7|PID:1932168] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:7|PID:1932168] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:7|PID:1932168] TIMING: [Load: 2.8] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:27|PID:1932168] Started. (#27/50) sampling job. Load & sample from subject of index (in user's list): 122
[TRA|JOB:27|PID:1932168] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1502_VISIT 16.nii.gz
[TRA|JOB:27|PID:1932168] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:27|PID:1932168] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:27|PID:1932168] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:27|PID:1932168] TIMING: [Load: 3.8] [Preproc: 0.5] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:11|PID:1932172] Started. (#11/50) sampling job. Load & sample from subject of index (in user's list): 14
[TRA|JOB:11|PID:1932172] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3963_VISIT 3.nii.gz
[TRA|JOB:11|PID:1932172] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:11|PID:1932172] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:11|PID:1932172] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:11|PID:1932172] TIMING: [Load: 4.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:37|PID:1932172] Started. (#37/50) sampling job. Load & sample from subject of index (in user's list): 82
[TRA|JOB:37|PID:1932172] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1763_VISIT 3.nii.gz
[TRA|JOB:37|PID:1932172] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:37|PID:1932172] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:37|PID:1932172] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:37|PID:1932172] TIMING: [Load: 2.7] [Preproc: 0.1] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:0|PID:1932161] Started. (#0/50) sampling job. Load & sample from subject of index (in user's list): 62
[TRA|JOB:0|PID:1932161] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4501_SCREENING.nii.gz
[TRA|JOB:0|PID:1932161] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:0|PID:1932161] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:0|PID:1932161] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:0|PID:1932161] TIMING: [Load: 2.1] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:21|PID:1932161] Started. (#21/50) sampling job. Load & sample from subject of index (in user's list): 134
[TRA|JOB:21|PID:1932161] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 8421_VISIT 8.nii.gz
[TRA|JOB:21|PID:1932161] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:21|PID:1932161] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:21|PID:1932161] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:21|PID:1932161] TIMING: [Load: 2.9] [Preproc: 0.5] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:40|PID:1932161] Started. (#40/50) sampling job. Load & sample from subject of index (in user's list): 91
[TRA|JOB:40|PID:1932161] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 7605_VISIT 4.nii.gz
[TRA|JOB:40|PID:1932161] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:40|PID:1932161] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:40|PID:1932161] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:40|PID:1932161] TIMING: [Load: 2.9] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:4|PID:1932165] Started. (#4/50) sampling job. Load & sample from subject of index (in user's list): 68
[TRA|JOB:4|PID:1932165] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2305_SCREENING.nii.gz
[TRA|JOB:4|PID:1932165] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:4|PID:1932165] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:4|PID:1932165] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:4|PID:1932165] TIMING: [Load: 4.6] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:38|PID:1932165] Started. (#38/50) sampling job. Load & sample from subject of index (in user's list): 51
[TRA|JOB:38|PID:1932165] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4102_VISIT 3.nii.gz
[TRA|JOB:38|PID:1932165] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:38|PID:1932165] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:38|PID:1932165] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:38|PID:1932165] TIMING: [Load: 3.6] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:3|PID:1932164] Started. (#3/50) sampling job. Load & sample from subject of index (in user's list): 44
[TRA|JOB:3|PID:1932164] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4344_VISIT 6.nii.gz
[TRA|JOB:3|PID:1932164] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:3|PID:1932164] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:3|PID:1932164] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:3|PID:1932164] TIMING: [Load: 3.6] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.6] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:34|PID:1932164] Started. (#34/50) sampling job. Load & sample from subject of index (in user's list): 174
[TRA|JOB:34|PID:1932164] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3963_POST-PROGRESSION VISIT 1.nii.gz
[TRA|JOB:34|PID:1932164] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:34|PID:1932164] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:34|PID:1932164] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:34|PID:1932164] TIMING: [Load: 4.0] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:9|PID:1932170] Started. (#9/50) sampling job. Load & sample from subject of index (in user's list): 199
[TRA|JOB:9|PID:1932170] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1561_VISIT 7.nii.gz
[TRA|JOB:9|PID:1932170] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:9|PID:1932170] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:9|PID:1932170] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:9|PID:1932170] TIMING: [Load: 4.2] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:33|PID:1932170] Started. (#33/50) sampling job. Load & sample from subject of index (in user's list): 168
[TRA|JOB:33|PID:1932170] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4344_VISIT 5.nii.gz
[TRA|JOB:33|PID:1932170] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:33|PID:1932170] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:33|PID:1932170] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:33|PID:1932170] TIMING: [Load: 4.3] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:5|PID:1932166] Started. (#5/50) sampling job. Load & sample from subject of index (in user's list): 19
[TRA|JOB:5|PID:1932166] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 7623_VISIT 2.nii.gz
[TRA|JOB:5|PID:1932166] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:5|PID:1932166] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:5|PID:1932166] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:5|PID:1932166] TIMING: [Load: 1.8] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:20|PID:1932166] Started. (#20/50) sampling job. Load & sample from subject of index (in user's list): 12
[TRA|JOB:20|PID:1932166] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5223_VISIT 4.nii.gz
[TRA|JOB:20|PID:1932166] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:20|PID:1932166] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:20|PID:1932166] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:20|PID:1932166] TIMING: [Load: 3.8] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:42|PID:1932166] Started. (#42/50) sampling job. Load & sample from subject of index (in user's list): 143
[TRA|JOB:42|PID:1932166] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1523_VISIT 2.nii.gz
[TRA|JOB:42|PID:1932166] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:42|PID:1932166] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:42|PID:1932166] WARN: Invalid sampling category! Sampling map just zeros! No [Class-1] samples from this subject!
[TRA|JOB:42|PID:1932166] Done. Samples per category: [Class-0: 20/20] 
[TRA|JOB:42|PID:1932166] TIMING: [Load: 3.0] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:12|PID:1932173] Started. (#12/50) sampling job. Load & sample from subject of index (in user's list): 57
[TRA|JOB:12|PID:1932173] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3607_VISIT 11.nii.gz
[TRA|JOB:12|PID:1932173] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:12|PID:1932173] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:12|PID:1932173] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:12|PID:1932173] TIMING: [Load: 3.0] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.0] [Augm-Samples: 0.1] secs
[TRA|JOB:26|PID:1932173] Started. (#26/50) sampling job. Load & sample from subject of index (in user's list): 181
[TRA|JOB:26|PID:1932173] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2862_SCREENING.nii.gz
[TRA|JOB:26|PID:1932173] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:26|PID:1932173] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:26|PID:1932173] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:26|PID:1932173] TIMING: [Load: 2.4] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:45|PID:1932173] Started. (#45/50) sampling job. Load & sample from subject of index (in user's list): 58
[TRA|JOB:45|PID:1932173] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3961_VISIT 2.nii.gz
[TRA|JOB:45|PID:1932173] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:45|PID:1932173] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:45|PID:1932173] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:45|PID:1932173] TIMING: [Load: 2.5] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:2|PID:1932163] Started. (#2/50) sampling job. Load & sample from subject of index (in user's list): 207
[TRA|JOB:2|PID:1932163] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3961_SCREENING.nii.gz
[TRA|JOB:2|PID:1932163] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:2|PID:1932163] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:2|PID:1932163] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:2|PID:1932163] TIMING: [Load: 5.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:39|PID:1932163] Started. (#39/50) sampling job. Load & sample from subject of index (in user's list): 56
[TRA|JOB:39|PID:1932163] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5701_VISIT 6.nii.gz
[TRA|JOB:39|PID:1932163] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:39|PID:1932163] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:39|PID:1932163] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:39|PID:1932163] TIMING: [Load: 3.2] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:10|PID:1932171] Started. (#10/50) sampling job. Load & sample from subject of index (in user's list): 106
[TRA|JOB:10|PID:1932171] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 8421_VISIT 7.nii.gz
[TRA|JOB:10|PID:1932171] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:10|PID:1932171] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:10|PID:1932171] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:10|PID:1932171] TIMING: [Load: 2.7] [Preproc: 0.1] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:25|PID:1932171] Started. (#25/50) sampling job. Load & sample from subject of index (in user's list): 121
[TRA|JOB:25|PID:1932171] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2305_VISIT 7.nii.gz
[TRA|JOB:25|PID:1932171] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:25|PID:1932171] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[...]
[TRA|JOB:47|PID:1932211] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:47|PID:1932211] TIMING: [Load: 3.0] [Preproc: 0.1] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
2023-03-07 13:03:03.204001: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:03.206020: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:13.208024: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:13.209956: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:13.209996: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 949.2KiB (rounded to 972032)requested by op gradients/trainer/Sum_12_grad/Tile/_2__cf__2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-03-07 13:03:13.210015: I tensorflow/tsl/framework/bfc_allocator.cc:1034] BFCAllocator dump for GPU_0_bfc
2023-03-07 13:03:13.210030: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (256):    Total Chunks: 210, Chunks in use: 210. 52.5KiB allocated for chunks. 52.5KiB in use in bin. 22.9KiB client-requested in use in bin.
2023-03-07 13:03:13.210043: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (512):    Total Chunks: 128, Chunks in use: 128. 64.0KiB allocated for chunks. 64.0KiB in use in bin. 42.5KiB client-requested in use in bin.
2023-03-07 13:03:13.210055: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (1024):   Total Chunks: 25, Chunks in use: 25. 35.2KiB allocated for chunks. 35.2KiB in use in bin. 33.5KiB client-requested in use in bin.
2023-03-07 13:03:13.210068: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (2048):   Total Chunks: 19, Chunks in use: 19. 58.0KiB allocated for chunks. 58.0KiB in use in bin. 56.0KiB client-requested in use in bin.
2023-03-07 13:03:13.210080: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (4096):   Total Chunks: 8, Chunks in use: 8. 58.0KiB allocated for chunks. 58.0KiB in use in bin. 56.2KiB client-requested in use in bin.
2023-03-07 13:03:13.210092: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (8192):   Total Chunks: 24, Chunks in use: 24. 284.0KiB allocated for chunks. 284.0KiB in use in bin. 281.2KiB client-requested in use in bin.
2023-03-07 13:03:13.210104: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (16384):  Total Chunks: 32, Chunks in use: 32. 640.0KiB allocated for chunks. 640.0KiB in use in bin. 637.5KiB client-requested in use in bin.
2023-03-07 13:03:13.210120: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (32768):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210134: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (65536):  Total Chunks: 18, Chunks in use: 18. 1.96MiB allocated for chunks. 1.96MiB in use in bin. 1.96MiB client-requested in use in bin.
2023-03-07 13:03:13.210147: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (131072):     Total Chunks: 12, Chunks in use: 12. 2.47MiB allocated for chunks. 2.47MiB in use in bin. 2.47MiB client-requested in use in bin.
2023-03-07 13:03:13.210159: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (262144):     Total Chunks: 27, Chunks in use: 27. 9.93MiB allocated for chunks. 9.93MiB in use in bin. 9.93MiB client-requested in use in bin.
2023-03-07 13:03:13.210171: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (524288):     Total Chunks: 40, Chunks in use: 40. 29.04MiB allocated for chunks. 29.04MiB in use in bin. 29.03MiB client-requested in use in bin.
2023-03-07 13:03:13.210184: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (1048576):    Total Chunks: 13, Chunks in use: 13. 14.78MiB allocated for chunks. 14.78MiB in use in bin. 14.34MiB client-requested in use in bin.
2023-03-07 13:03:13.210195: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (2097152):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210207: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (4194304):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210218: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (8388608):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210229: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (16777216):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210240: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (33554432):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210252: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (67108864):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210263: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (134217728):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210274: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (268435456):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210287: I tensorflow/tsl/framework/bfc_allocator.cc:1057] Bin for 949.2KiB was 512.0KiB, Chunk State: 
2023-03-07 13:03:13.210298: I tensorflow/tsl/framework/bfc_allocator.cc:1070] Next region of size 62226432
2023-03-07 13:03:13.210312: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c000000 of size 3328 next 1
2023-03-07 13:03:13.210325: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c000d00 of size 129792 next 2
2023-03-07 13:03:13.210337: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c020800 of size 129792 next 3
2023-03-07 13:03:13.210348: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c040300 of size 216064 next 4
2023-03-07 13:03:13.210360: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c074f00 of size 324096 next 5
2023-03-07 13:03:13.210371: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c0c4100 of size 453632 next 6
2023-03-07 13:03:13.210383: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c132d00 of size 604928 next 7
2023-03-07 13:03:13.210394: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c1c6800 of size 777728 next 8
2023-03-07 13:03:13.210406: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c284600 of size 972032 next 9
2023-03-07 13:03:13.210418: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c371b00 of size 1188096 next 10
2023-03-07 13:03:13.210429: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c493c00 of size 3328 next 11
2023-03-07 13:03:13.210440: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c494900 of size 129792 next 12
2023-03-07 13:03:13.210452: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c4b4400 of size 216064 next 13
2023-03-07 13:03:13.210463: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c4e9000 of size 216064 next 14
2023-03-07 13:03:13.210475: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c51dc00 of size 324096 next 15
2023-03-07 13:03:13.210486: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c56ce00 of size 453632 next 16
2023-03-07 13:03:13.210497: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c5dba00 of size 604928 next 17
2023-03-07 13:03:13.210512: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c66f500 of size 777728 next 18
2023-03-07 13:03:13.210525: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c72d300 of size 972032 next 19
2023-03-07 13:03:13.210536: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c81a800 of size 1188096 next 20
2023-03-07 13:03:13.210547: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c93c900 of size 3328 next 21
[...]
2023-03-07 13:03:13.216636: I tensorflow/tsl/framework/bfc_allocator.cc:1104] total_region_allocated_bytes_: 62226432 memory_limit_: 69140480 available bytes: 6914048 curr_region_allocation_bytes_: 138280960
2023-03-07 13:03:13.216650: I tensorflow/tsl/framework/bfc_allocator.cc:1110] Stats: 
Limit:                        69140480
InUse:                        62226432
MaxInUse:                     62226432
NumAllocs:                         556
MaxAllocSize:                  1239552
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2023-03-07 13:03:13.216679: W tensorflow/tsl/framework/bfc_allocator.cc:492] ****************************************************************************************************
2023-03-07 13:03:13.216720: W tensorflow/core/framework/op_kernel.cc:1807] OP_REQUIRES failed at constant_op.cc:81 : RESOURCE_EXHAUSTED: OOM when allocating tensor of shape [100,90,3,3,3] and type float
[TRA|SAMPLER|PID:1932125] TIMING: Sampling for next [Training] lasted: 13.3 secs.
[TRA|SAMPLER|PID:1932125] :=:=:=:=:=:= Finished sampling for next [Training] =:=:=:=:=:=:

 ERROR: Caught exception in do_training(): Graph execution error:

OOM when allocating tensor of shape [100,90,3,3,3] and type float
     [[{{node gradients/trainer/Sum_12_grad/Tile/_2__cf__2}}]]

Traceback (most recent call last):
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
    return fn(*args)
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [100,90,3,3,3] and type float
     [[{{node gradients/trainer/Sum_12_grad/Tile/_2__cf__2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/projects/site/pred/gbm_pilot/BRAIN_METS/ROCMETS-74/DeepMedicPlus/DeepMedicPlus/deepmedic/routines/training.py", line 353, in do_training
    process_in_batches(log,
  File "/projects/site/pred/gbm_pilot/BRAIN_METS/ROCMETS-74/DeepMedicPlus/DeepMedicPlus/deepmedic/routines/training.py", line 62, in process_in_batches
    results_of_run = sessionTf.run(fetches=list_of_ops, feed_dict=feeds_dict)
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 968, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

OOM when allocating tensor of shape [100,90,3,3,3] and type float
     [[{{node gradients/trainer/Sum_12_grad/Tile/_2__cf__2}}]]

Terminating worker pool.

=======================================================
=========== Training session finished =================
=======================================================
Finished.
YixingHuang commented 1 year ago

Hi @damiankucharski, our GPU has 48GB. Do you use the same batch size? You may try to reduce the batch size from 10 to 8, for example, or even smaller.

damiankucharski commented 1 year ago

Hi @YixingHuang, I managed to fix the issue without changing the batch size. It seems that TensorFlow by default allocates all the available GPU memory which can cause issues in some cases like my own. I have submitted a small PR with a piece of code that solved the issue for me. You can consider merging it if you find it useful. https://github.com/YixingHuang/DeepMedicPlus/pull/10

YixingHuang commented 1 year ago

Good to know. PR has been approved. Thank you.

damiankucharski commented 1 year ago

Thank you @YixingHuang, I think you have to merge it yourself, I do not have permission to do that.