Some trainings run out of memory even for large GPU

Hi @YixingHuang, I am getting out of memory errors when training on a little larger datasets. Sometimes trainings do not fail but often they do, it seems more or less random. I have 225 training subjects so the dataset is not that large. I am using 40GB Nvidia A100 GPU so the memory shouldn't be a problem. Do you think that something in code may be causing bad memory management? I am attaching the (truncated due to length) log of one of the failed trainings.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~           Starting new Epoch! Epoch #0/50            ~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

***********************************************************************************
*            Starting new Subepoch: #0/20           *
***********************************************************************************
[MAIN|PID:1932125] MULTIPROC: Before Training in subepoch #0, submitting sampling job for next [TRAINING].
[TRA|SAMPLER|PID:1932125] :=:=:=:=:=:=: Starting to sample for next [Training]... :=:=:=:=:=:=:
[TRA|SAMPLER|PID:1932125] Out of [210] subjects given for [Training], we will sample from maximum [50] per subepoch.
[TRA|SAMPLER|PID:1932125] Shuffled indices of subjects that were randomly chosen: [62, 162, 207, 44, 68, 19, 107, 95, 184, 199, 106, 14, 57, 111, 99, 100, 83, 32, 36, 186, 12, 134, 124, 115, 73, 121, 181, 122, 9, 204, 26, 10, 37, 168, 174, 79, 126, 82, 51, 56, 91, 164, 143, 112, 76, 58, 41, 141, 142, 43]
[TRA|SAMPLER|PID:1932125] Will sample from [50] subjects for next Training...
[TRA|SAMPLER|PID:1932125] ******* Spawning children processes to sample from [50] subjects*******
[TRA|SAMPLER|PID:1932125] MULTIPR: Number of CPUs detected: 64. Requested to use max: [20]
[TRA|SAMPLER|PID:1932125] MULTIPR: Spawning [20] processes to load and sample.
[TRA|JOB:14|PID:1932175] Started. (#14/50) sampling job. Load & sample from subject of index (in user's list): 99
[TRA|JOB:14|PID:1932175] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2305_VISIT 9.nii.gz
[TRA|JOB:14|PID:1932175] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:14|PID:1932175] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:14|PID:1932175] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:14|PID:1932175] TIMING: [Load: 4.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.1] secs
[TRA|JOB:36|PID:1932175] Started. (#36/50) sampling job. Load & sample from subject of index (in user's list): 126
[TRA|JOB:36|PID:1932175] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5701_VISIT 2.nii.gz
[TRA|JOB:36|PID:1932175] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:36|PID:1932175] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:36|PID:1932175] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:36|PID:1932175] TIMING: [Load: 2.5] [Preproc: 0.7] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:13|PID:1932174] Started. (#13/50) sampling job. Load & sample from subject of index (in user's list): 111
[TRA|JOB:13|PID:1932174] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4345_VISIT 2.nii.gz
[TRA|JOB:13|PID:1932174] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:13|PID:1932174] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:13|PID:1932174] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:13|PID:1932174] TIMING: [Load: 4.4] [Preproc: 0.5] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:35|PID:1932174] Started. (#35/50) sampling job. Load & sample from subject of index (in user's list): 79
[TRA|JOB:35|PID:1932174] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1541_POST-TREATMENT VISIT.nii.gz
[TRA|JOB:35|PID:1932174] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:35|PID:1932174] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:35|PID:1932174] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:35|PID:1932174] TIMING: [Load: 3.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:15|PID:1932176] Started. (#15/50) sampling job. Load & sample from subject of index (in user's list): 100
[TRA|JOB:15|PID:1932176] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1121_UNSCHEDULED 1.nii.gz
[TRA|JOB:15|PID:1932176] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:15|PID:1932176] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:15|PID:1932176] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:15|PID:1932176] TIMING: [Load: 3.7] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:32|PID:1932176] Started. (#32/50) sampling job. Load & sample from subject of index (in user's list): 37
[TRA|JOB:32|PID:1932176] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 7843_VISIT 2.nii.gz
[TRA|JOB:32|PID:1932176] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:32|PID:1932176] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:32|PID:1932176] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:32|PID:1932176] TIMING: [Load: 3.6] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.0] [Augm-Samples: 0.1] secs
[TRA|JOB:16|PID:1932177] Started. (#16/50) sampling job. Load & sample from subject of index (in user's list): 83
[TRA|JOB:16|PID:1932177] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 8421_VISIT 4.nii.gz
[TRA|JOB:16|PID:1932177] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:16|PID:1932177] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:16|PID:1932177] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:16|PID:1932177] TIMING: [Load: 2.9] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.6] [Extract Sampl: 0.0] [Augm-Samples: 0.1] secs
[TRA|JOB:31|PID:1932177] Started. (#31/50) sampling job. Load & sample from subject of index (in user's list): 10
[TRA|JOB:31|PID:1932177] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5621_SCREENING.nii.gz
[TRA|JOB:31|PID:1932177] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:31|PID:1932177] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:31|PID:1932177] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:31|PID:1932177] TIMING: [Load: 3.2] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:7|PID:1932168] Started. (#7/50) sampling job. Load & sample from subject of index (in user's list): 95
[TRA|JOB:7|PID:1932168] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4344_VISIT 10.nii.gz
[TRA|JOB:7|PID:1932168] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:7|PID:1932168] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:7|PID:1932168] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:7|PID:1932168] TIMING: [Load: 2.8] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:27|PID:1932168] Started. (#27/50) sampling job. Load & sample from subject of index (in user's list): 122
[TRA|JOB:27|PID:1932168] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1502_VISIT 16.nii.gz
[TRA|JOB:27|PID:1932168] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:27|PID:1932168] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:27|PID:1932168] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:27|PID:1932168] TIMING: [Load: 3.8] [Preproc: 0.5] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:11|PID:1932172] Started. (#11/50) sampling job. Load & sample from subject of index (in user's list): 14
[TRA|JOB:11|PID:1932172] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3963_VISIT 3.nii.gz
[TRA|JOB:11|PID:1932172] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:11|PID:1932172] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:11|PID:1932172] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:11|PID:1932172] TIMING: [Load: 4.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:37|PID:1932172] Started. (#37/50) sampling job. Load & sample from subject of index (in user's list): 82
[TRA|JOB:37|PID:1932172] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1763_VISIT 3.nii.gz
[TRA|JOB:37|PID:1932172] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:37|PID:1932172] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:37|PID:1932172] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:37|PID:1932172] TIMING: [Load: 2.7] [Preproc: 0.1] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:0|PID:1932161] Started. (#0/50) sampling job. Load & sample from subject of index (in user's list): 62
[TRA|JOB:0|PID:1932161] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4501_SCREENING.nii.gz
[TRA|JOB:0|PID:1932161] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:0|PID:1932161] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:0|PID:1932161] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:0|PID:1932161] TIMING: [Load: 2.1] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:21|PID:1932161] Started. (#21/50) sampling job. Load & sample from subject of index (in user's list): 134
[TRA|JOB:21|PID:1932161] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 8421_VISIT 8.nii.gz
[TRA|JOB:21|PID:1932161] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:21|PID:1932161] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:21|PID:1932161] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:21|PID:1932161] TIMING: [Load: 2.9] [Preproc: 0.5] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:40|PID:1932161] Started. (#40/50) sampling job. Load & sample from subject of index (in user's list): 91
[TRA|JOB:40|PID:1932161] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 7605_VISIT 4.nii.gz
[TRA|JOB:40|PID:1932161] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:40|PID:1932161] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:40|PID:1932161] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:40|PID:1932161] TIMING: [Load: 2.9] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:4|PID:1932165] Started. (#4/50) sampling job. Load & sample from subject of index (in user's list): 68
[TRA|JOB:4|PID:1932165] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2305_SCREENING.nii.gz
[TRA|JOB:4|PID:1932165] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:4|PID:1932165] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:4|PID:1932165] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:4|PID:1932165] TIMING: [Load: 4.6] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:38|PID:1932165] Started. (#38/50) sampling job. Load & sample from subject of index (in user's list): 51
[TRA|JOB:38|PID:1932165] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4102_VISIT 3.nii.gz
[TRA|JOB:38|PID:1932165] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:38|PID:1932165] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:38|PID:1932165] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:38|PID:1932165] TIMING: [Load: 3.6] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:3|PID:1932164] Started. (#3/50) sampling job. Load & sample from subject of index (in user's list): 44
[TRA|JOB:3|PID:1932164] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4344_VISIT 6.nii.gz
[TRA|JOB:3|PID:1932164] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:3|PID:1932164] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:3|PID:1932164] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:3|PID:1932164] TIMING: [Load: 3.6] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.6] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:34|PID:1932164] Started. (#34/50) sampling job. Load & sample from subject of index (in user's list): 174
[TRA|JOB:34|PID:1932164] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3963_POST-PROGRESSION VISIT 1.nii.gz
[TRA|JOB:34|PID:1932164] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:34|PID:1932164] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:34|PID:1932164] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:34|PID:1932164] TIMING: [Load: 4.0] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:9|PID:1932170] Started. (#9/50) sampling job. Load & sample from subject of index (in user's list): 199
[TRA|JOB:9|PID:1932170] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1561_VISIT 7.nii.gz
[TRA|JOB:9|PID:1932170] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:9|PID:1932170] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:9|PID:1932170] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:9|PID:1932170] TIMING: [Load: 4.2] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.3] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:33|PID:1932170] Started. (#33/50) sampling job. Load & sample from subject of index (in user's list): 168
[TRA|JOB:33|PID:1932170] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 4344_VISIT 5.nii.gz
[TRA|JOB:33|PID:1932170] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:33|PID:1932170] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:33|PID:1932170] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:33|PID:1932170] TIMING: [Load: 4.3] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:5|PID:1932166] Started. (#5/50) sampling job. Load & sample from subject of index (in user's list): 19
[TRA|JOB:5|PID:1932166] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 7623_VISIT 2.nii.gz
[TRA|JOB:5|PID:1932166] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:5|PID:1932166] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:5|PID:1932166] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:5|PID:1932166] TIMING: [Load: 1.8] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:20|PID:1932166] Started. (#20/50) sampling job. Load & sample from subject of index (in user's list): 12
[TRA|JOB:20|PID:1932166] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5223_VISIT 4.nii.gz
[TRA|JOB:20|PID:1932166] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:20|PID:1932166] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:20|PID:1932166] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:20|PID:1932166] TIMING: [Load: 3.8] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:42|PID:1932166] Started. (#42/50) sampling job. Load & sample from subject of index (in user's list): 143
[TRA|JOB:42|PID:1932166] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 1523_VISIT 2.nii.gz
[TRA|JOB:42|PID:1932166] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:42|PID:1932166] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:42|PID:1932166] WARN: Invalid sampling category! Sampling map just zeros! No [Class-1] samples from this subject!
[TRA|JOB:42|PID:1932166] Done. Samples per category: [Class-0: 20/20] 
[TRA|JOB:42|PID:1932166] TIMING: [Load: 3.0] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:12|PID:1932173] Started. (#12/50) sampling job. Load & sample from subject of index (in user's list): 57
[TRA|JOB:12|PID:1932173] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3607_VISIT 11.nii.gz
[TRA|JOB:12|PID:1932173] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:12|PID:1932173] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:12|PID:1932173] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:12|PID:1932173] TIMING: [Load: 3.0] [Preproc: 0.3] [Augm-Img: 0.0] [Sample Coords: 0.4] [Extract Sampl: 0.0] [Augm-Samples: 0.1] secs
[TRA|JOB:26|PID:1932173] Started. (#26/50) sampling job. Load & sample from subject of index (in user's list): 181
[TRA|JOB:26|PID:1932173] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2862_SCREENING.nii.gz
[TRA|JOB:26|PID:1932173] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:26|PID:1932173] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:26|PID:1932173] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:26|PID:1932173] TIMING: [Load: 2.4] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:45|PID:1932173] Started. (#45/50) sampling job. Load & sample from subject of index (in user's list): 58
[TRA|JOB:45|PID:1932173] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3961_VISIT 2.nii.gz
[TRA|JOB:45|PID:1932173] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:45|PID:1932173] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:45|PID:1932173] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:45|PID:1932173] TIMING: [Load: 2.5] [Preproc: 0.2] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:2|PID:1932163] Started. (#2/50) sampling job. Load & sample from subject of index (in user's list): 207
[TRA|JOB:2|PID:1932163] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 3961_SCREENING.nii.gz
[TRA|JOB:2|PID:1932163] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:2|PID:1932163] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:2|PID:1932163] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:2|PID:1932163] TIMING: [Load: 5.0] [Preproc: 0.6] [Augm-Img: 0.0] [Sample Coords: 0.5] [Extract Sampl: 0.1] [Augm-Samples: 0.0] secs
[TRA|JOB:39|PID:1932163] Started. (#39/50) sampling job. Load & sample from subject of index (in user's list): 56
[TRA|JOB:39|PID:1932163] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 5701_VISIT 6.nii.gz
[TRA|JOB:39|PID:1932163] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:39|PID:1932163] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:39|PID:1932163] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:39|PID:1932163] TIMING: [Load: 3.2] [Preproc: 0.4] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:10|PID:1932171] Started. (#10/50) sampling job. Load & sample from subject of index (in user's list): 106
[TRA|JOB:10|PID:1932171] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 8421_VISIT 7.nii.gz
[TRA|JOB:10|PID:1932171] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:10|PID:1932171] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[TRA|JOB:10|PID:1932171] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:10|PID:1932171] TIMING: [Load: 2.7] [Preproc: 0.1] [Augm-Img: 0.0] [Sample Coords: 0.2] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
[TRA|JOB:25|PID:1932171] Started. (#25/50) sampling job. Load & sample from subject of index (in user's list): 121
[TRA|JOB:25|PID:1932171] Loading subject with 1st channel at: /home/ROCMETS-90/DATA/SCANS_PREPROCESSED_TRAIN/PATIENT 2305_VISIT 7.nii.gz
[TRA|JOB:25|PID:1932171] WARN: Loaded labels are dtype [float64]. Rounding and casting to [int16]!
[TRA|JOB:25|PID:1932171] WARN: Loaded ROI-mask is dtype [float64]. Rounding and casting to [int16]!
[0.5 0.5]
[...]
[TRA|JOB:47|PID:1932211] Done. Samples per category: [Class-0: 10/10] [Class-1: 10/10] 
[TRA|JOB:47|PID:1932211] TIMING: [Load: 3.0] [Preproc: 0.1] [Augm-Img: 0.0] [Sample Coords: 0.1] [Extract Sampl: 0.0] [Augm-Samples: 0.0] secs
2023-03-07 13:03:03.204001: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:03.206020: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:13.208024: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:13.209956: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:735] failed to allocate 6.59M (6914048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-03-07 13:03:13.209996: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 949.2KiB (rounded to 972032)requested by op gradients/trainer/Sum_12_grad/Tile/_2__cf__2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-03-07 13:03:13.210015: I tensorflow/tsl/framework/bfc_allocator.cc:1034] BFCAllocator dump for GPU_0_bfc
2023-03-07 13:03:13.210030: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (256):    Total Chunks: 210, Chunks in use: 210. 52.5KiB allocated for chunks. 52.5KiB in use in bin. 22.9KiB client-requested in use in bin.
2023-03-07 13:03:13.210043: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (512):    Total Chunks: 128, Chunks in use: 128. 64.0KiB allocated for chunks. 64.0KiB in use in bin. 42.5KiB client-requested in use in bin.
2023-03-07 13:03:13.210055: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (1024):   Total Chunks: 25, Chunks in use: 25. 35.2KiB allocated for chunks. 35.2KiB in use in bin. 33.5KiB client-requested in use in bin.
2023-03-07 13:03:13.210068: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (2048):   Total Chunks: 19, Chunks in use: 19. 58.0KiB allocated for chunks. 58.0KiB in use in bin. 56.0KiB client-requested in use in bin.
2023-03-07 13:03:13.210080: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (4096):   Total Chunks: 8, Chunks in use: 8. 58.0KiB allocated for chunks. 58.0KiB in use in bin. 56.2KiB client-requested in use in bin.
2023-03-07 13:03:13.210092: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (8192):   Total Chunks: 24, Chunks in use: 24. 284.0KiB allocated for chunks. 284.0KiB in use in bin. 281.2KiB client-requested in use in bin.
2023-03-07 13:03:13.210104: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (16384):  Total Chunks: 32, Chunks in use: 32. 640.0KiB allocated for chunks. 640.0KiB in use in bin. 637.5KiB client-requested in use in bin.
2023-03-07 13:03:13.210120: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (32768):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210134: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (65536):  Total Chunks: 18, Chunks in use: 18. 1.96MiB allocated for chunks. 1.96MiB in use in bin. 1.96MiB client-requested in use in bin.
2023-03-07 13:03:13.210147: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (131072):     Total Chunks: 12, Chunks in use: 12. 2.47MiB allocated for chunks. 2.47MiB in use in bin. 2.47MiB client-requested in use in bin.
2023-03-07 13:03:13.210159: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (262144):     Total Chunks: 27, Chunks in use: 27. 9.93MiB allocated for chunks. 9.93MiB in use in bin. 9.93MiB client-requested in use in bin.
2023-03-07 13:03:13.210171: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (524288):     Total Chunks: 40, Chunks in use: 40. 29.04MiB allocated for chunks. 29.04MiB in use in bin. 29.03MiB client-requested in use in bin.
2023-03-07 13:03:13.210184: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (1048576):    Total Chunks: 13, Chunks in use: 13. 14.78MiB allocated for chunks. 14.78MiB in use in bin. 14.34MiB client-requested in use in bin.
2023-03-07 13:03:13.210195: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (2097152):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210207: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (4194304):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210218: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (8388608):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210229: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (16777216):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210240: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (33554432):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210252: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (67108864):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210263: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (134217728):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210274: I tensorflow/tsl/framework/bfc_allocator.cc:1041] Bin (268435456):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-03-07 13:03:13.210287: I tensorflow/tsl/framework/bfc_allocator.cc:1057] Bin for 949.2KiB was 512.0KiB, Chunk State: 
2023-03-07 13:03:13.210298: I tensorflow/tsl/framework/bfc_allocator.cc:1070] Next region of size 62226432
2023-03-07 13:03:13.210312: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c000000 of size 3328 next 1
2023-03-07 13:03:13.210325: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c000d00 of size 129792 next 2
2023-03-07 13:03:13.210337: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c020800 of size 129792 next 3
2023-03-07 13:03:13.210348: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c040300 of size 216064 next 4
2023-03-07 13:03:13.210360: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c074f00 of size 324096 next 5
2023-03-07 13:03:13.210371: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c0c4100 of size 453632 next 6
2023-03-07 13:03:13.210383: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c132d00 of size 604928 next 7
2023-03-07 13:03:13.210394: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c1c6800 of size 777728 next 8
2023-03-07 13:03:13.210406: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c284600 of size 972032 next 9
2023-03-07 13:03:13.210418: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c371b00 of size 1188096 next 10
2023-03-07 13:03:13.210429: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c493c00 of size 3328 next 11
2023-03-07 13:03:13.210440: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c494900 of size 129792 next 12
2023-03-07 13:03:13.210452: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c4b4400 of size 216064 next 13
2023-03-07 13:03:13.210463: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c4e9000 of size 216064 next 14
2023-03-07 13:03:13.210475: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c51dc00 of size 324096 next 15
2023-03-07 13:03:13.210486: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c56ce00 of size 453632 next 16
2023-03-07 13:03:13.210497: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c5dba00 of size 604928 next 17
2023-03-07 13:03:13.210512: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c66f500 of size 777728 next 18
2023-03-07 13:03:13.210525: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c72d300 of size 972032 next 19
2023-03-07 13:03:13.210536: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c81a800 of size 1188096 next 20
2023-03-07 13:03:13.210547: I tensorflow/tsl/framework/bfc_allocator.cc:1090] InUse at 149e0c93c900 of size 3328 next 21
[...]
2023-03-07 13:03:13.216636: I tensorflow/tsl/framework/bfc_allocator.cc:1104] total_region_allocated_bytes_: 62226432 memory_limit_: 69140480 available bytes: 6914048 curr_region_allocation_bytes_: 138280960
2023-03-07 13:03:13.216650: I tensorflow/tsl/framework/bfc_allocator.cc:1110] Stats: 
Limit:                        69140480
InUse:                        62226432
MaxInUse:                     62226432
NumAllocs:                         556
MaxAllocSize:                  1239552
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2023-03-07 13:03:13.216679: W tensorflow/tsl/framework/bfc_allocator.cc:492] ****************************************************************************************************
2023-03-07 13:03:13.216720: W tensorflow/core/framework/op_kernel.cc:1807] OP_REQUIRES failed at constant_op.cc:81 : RESOURCE_EXHAUSTED: OOM when allocating tensor of shape [100,90,3,3,3] and type float
[TRA|SAMPLER|PID:1932125] TIMING: Sampling for next [Training] lasted: 13.3 secs.
[TRA|SAMPLER|PID:1932125] :=:=:=:=:=:= Finished sampling for next [Training] =:=:=:=:=:=:

 ERROR: Caught exception in do_training(): Graph execution error:

OOM when allocating tensor of shape [100,90,3,3,3] and type float
     [[{{node gradients/trainer/Sum_12_grad/Tile/_2__cf__2}}]]

Traceback (most recent call last):
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
    return fn(*args)
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [100,90,3,3,3] and type float
     [[{{node gradients/trainer/Sum_12_grad/Tile/_2__cf__2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/projects/site/pred/gbm_pilot/BRAIN_METS/ROCMETS-74/DeepMedicPlus/DeepMedicPlus/deepmedic/routines/training.py", line 353, in do_training
    process_in_batches(log,
  File "/projects/site/pred/gbm_pilot/BRAIN_METS/ROCMETS-74/DeepMedicPlus/DeepMedicPlus/deepmedic/routines/training.py", line 62, in process_in_batches
    results_of_run = sessionTf.run(fetches=list_of_ops, feed_dict=feeds_dict)
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 968, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/kucharsd/.local/share/jupyter/3.4.2/lib/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

OOM when allocating tensor of shape [100,90,3,3,3] and type float
     [[{{node gradients/trainer/Sum_12_grad/Tile/_2__cf__2}}]]

Terminating worker pool.

=======================================================
=========== Training session finished =================
=======================================================
Finished.
YixingHuang / DeepMedicPlus

Some trainings run out of memory even for large GPU #9