HELP! Error during training LIDC dataset

ivanwilliammd commented 5 years ago

Hello Sir. Paul, I already converted the LIDC Database, however after I run python exec.py --mode train --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-model, the training stuck (it shows validate) on folds 1. Note: I change num_epoch into 50 and num_trainbatches into 10 since I just use 10 sample dataset.

CLI message:

starting training epoch 50 tr. batch 1/10 (ep. 50) fw 2.251s / bw 0.743s / total 2.993s || loss: 1.03, class: 0.89, bbox: 0.14 tr. batch 2/10 (ep. 50) fw 2.532s / bw 0.744s / total 3.276s || loss: 0.89, class: 0.66, bbox: 0.23 tr. batch 3/10 (ep. 50) fw 2.392s / bw 0.742s / total 3.134s || loss: 0.74, class: 0.73, bbox: 0.01 tr. batch 4/10 (ep. 50) fw 2.535s / bw 0.517s / total 3.053s || loss: 0.47, class: 0.47, bbox: 0.00 tr. batch 5/10 (ep. 50) fw 3.106s / bw 0.744s / total 3.850s || loss: 0.78, class: 0.71, bbox: 0.08 tr. batch 6/10 (ep. 50) fw 2.920s / bw 0.742s / total 3.662s || loss: 0.52, class: 0.49, bbox: 0.03 tr. batch 7/10 (ep. 50) fw 2.220s / bw 0.747s / total 2.967s || loss: 0.67, class: 0.56, bbox: 0.11 tr. batch 8/10 (ep. 50) fw 2.164s / bw 0.758s / total 2.921s || loss: 0.57, class: 0.51, bbox: 0.06 tr. batch 9/10 (ep. 50) fw 2.333s / bw 0.750s / total 3.082s || loss: 0.80, class: 0.70, bbox: 0.10 tr. batch 10/10 (ep. 50) fw 2.390s / bw 0.760s / total 3.150s || loss: 0.70, class: 0.66, bbox: 0.03 evaluating in mode train evaluating with match_iou: 0.1 starting validation in mode val_sampling. evaluating in mode val_sampling evaluating with match_iou: 0.1 non none scores: [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.33691776e-04 1.12577370e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 3.19541394e-06 0.00000000e+00 0.00000000e+00 0.00000000e+00 6.34394073e-05 3.46760788e-04 0.00000000e+00 6.57964466e-05 6.30265885e-06 1.83419772e-04 0.00000000e+00 0.00000000e+00 3.13401814e-05 0.00000000e+00 0.00000000e+00 8.20894272e-05 4.21034540e-06 1.00719716e-03 7.65382661e-07 1.39219383e-05 7.98896203e-04 0.00000000e+00 2.30329873e-04 2.08085640e-04 1.10898187e-06 0.00000000e+00 0.00000000e+00 1.11219310e-05 1.91517091e-04 1.70706726e-04 1.07269665e-06 0.00000000e+00 0.00000000e+00 4.47997328e-05 0.00000000e+00 1.04838946e-06 1.86664529e-03 5.89871320e-06 1.97787268e-04] trained epoch 50: took 212.29711294174194 sec. (41.897600412368774 train / 170.39951252937317 val) plotting predictions from validation sampling. starting testing model of fold 0 in exp LIDC-Retina-TrainTest feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] anchor scales: {'z': [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], 'xy': [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 using default pytorch weight init subset: selected 2 instances from df data set loaded with: 2 test patients tmp ensembling over rank_ix:0 epoch:LIDC-Retina-TrainTest/fold_0/48_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:1 epoch:LIDC-Retina-TrainTest/fold_0/29_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:2 epoch:LIDC-Retina-TrainTest/fold_0/32_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:3 epoch:LIDC-Retina-TrainTest/fold_0/17_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:4 epoch:LIDC-Retina-TrainTest/fold_0/34_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) finished predicting test set. starting post-processing of predictions. applying wcs to test set predictions with iou = 1e-05 and n_ens = 20. applying 2Dto3D merging to test set predictions with iou = 0.1. evaluating in mode test evaluating with match_iou: 0.1 /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/matplotlib/axes/_base.py:3364: UserWarning: Attempting to set identical bottom==top results in singular transformations; automatically expanding. bottom=1.0, top=1.0 self.set_ylim(upper, lower, auto=None) Logging to LIDC-Retina-TrainTest/fold_1/exec.log performing training in 3D over fold 1 on experiment LIDC-Retina-TrainTest with model retina_net performing training in 3D over fold 1 on experiment LIDC-Retina-TrainTest with model retina_net feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] anchor scales: {'z': [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], 'xy': [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} anchor scales: {'z': [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], 'xy': [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 using default pytorch weight init using default pytorch weight init loading dataset and initializing batch generators... loading dataset and initializing batch generators... data set loaded with: 6 train / 2 val / 2 test patients data set loaded with: 6 train / 2 val / 2 test patients starting training epoch 1 starting training epoch 1 tr. batch 1/10 (ep. 1) fw 1.901s / bw 0.557s / total 2.458s || loss: 0.55, class: 0.55, bbox: 0.00 tr. batch 1/10 (ep. 1) fw 1.901s / bw 0.557s / total 2.458s || loss: 0.55, class: 0.55, bbox: 0.00 tr. batch 2/10 (ep. 1) fw 2.057s / bw 0.777s / total 2.834s || loss: 0.77, class: 0.69, bbox: 0.08 tr. batch 2/10 (ep. 1) fw 2.057s / bw 0.777s / total 2.834s || loss: 0.77, class: 0.69, bbox: 0.08 tr. batch 3/10 (ep. 1) fw 1.838s / bw 0.515s / total 2.353s || loss: 0.77, class: 0.77, bbox: 0.00 tr. batch 3/10 (ep. 1) fw 1.838s / bw 0.515s / total 2.353s || loss: 0.77, class: 0.77, bbox: 0.00 tr. batch 4/10 (ep. 1) fw 1.803s / bw 0.741s / total 2.544s || loss: 0.94, class: 0.83, bbox: 0.11 tr. batch 4/10 (ep. 1) fw 1.803s / bw 0.741s / total 2.544s || loss: 0.94, class: 0.83, bbox: 0.11 tr. batch 5/10 (ep. 1) fw 1.717s / bw 0.741s / total 2.458s || loss: 0.85, class: 0.76, bbox: 0.09 tr. batch 5/10 (ep. 1) fw 1.717s / bw 0.741s / total 2.458s || loss: 0.85, class: 0.76, bbox: 0.09 tr. batch 6/10 (ep. 1) fw 1.654s / bw 0.744s / total 2.398s || loss: 1.07, class: 0.90, bbox: 0.17 tr. batch 6/10 (ep. 1) fw 1.654s / bw 0.744s / total 2.398s || loss: 1.07, class: 0.90, bbox: 0.17 tr. batch 7/10 (ep. 1) fw 2.217s / bw 0.742s / total 2.959s || loss: 0.80, class: 0.69, bbox: 0.11 tr. batch 7/10 (ep. 1) fw 2.217s / bw 0.742s / total 2.959s || loss: 0.80, class: 0.69, bbox: 0.11 tr. batch 8/10 (ep. 1) fw 1.733s / bw 0.740s / total 2.473s || loss: 0.80, class: 0.69, bbox: 0.12 tr. batch 8/10 (ep. 1) fw 1.733s / bw 0.740s / total 2.473s || loss: 0.80, class: 0.69, bbox: 0.12 tr. batch 9/10 (ep. 1) fw 1.709s / bw 0.750s / total 2.459s || loss: 1.07, class: 0.89, bbox: 0.18 tr. batch 9/10 (ep. 1) fw 1.709s / bw 0.750s / total 2.459s || loss: 1.07, class: 0.89, bbox: 0.18 tr. batch 10/10 (ep. 1) fw 2.189s / bw 0.743s / total 2.932s || loss: 1.06, class: 0.89, bbox: 0.17 tr. batch 10/10 (ep. 1) fw 2.189s / bw 0.743s / total 2.932s || loss: 1.06, class: 0.89, bbox: 0.17 evaluating in mode train evaluating in mode train evaluating with match_iou: 0.1 evaluating with match_iou: 0.1 starting validation in mode val_sampling.

It just stuck at starting validation for more than 4 hours. Please help me. Thank you in advance Sir.

pfjaeger commented 5 years ago

Hi I guess you are stuck in line 225 of the dataloader, because your validation set in fold 1 does not contain images of all classes. Right now at least one image per class is required in every training and validation split.

I will catch this error in the next commit.

Why are you running code on 10 Patients though ?

On 10. Apr 2019, at 15:44, Ivan William H. notifications@github.com<mailto:notifications@github.com> wrote:

Hello Sir. Paul, I already converted the LIDC Database, however after I run python exec.py --mode train --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-model, the training stuck (it shows validate) on folds 1. Note: I change num_epoch into 50 and num_trainbatches into 10 since I just use 10 sample dataset.

CLI message:

starting training epoch 50 tr. batch 1/10 (ep. 50) fw 2.251s / bw 0.743s / total 2.993s || loss: 1.03, class: 0.89, bbox: 0.14 tr. batch 2/10 (ep. 50) fw 2.532s / bw 0.744s / total 3.276s || loss: 0.89, class: 0.66, bbox: 0.23 tr. batch 3/10 (ep. 50) fw 2.392s / bw 0.742s / total 3.134s || loss: 0.74, class: 0.73, bbox: 0.01 tr. batch 4/10 (ep. 50) fw 2.535s / bw 0.517s / total 3.053s || loss: 0.47, class: 0.47, bbox: 0.00 tr. batch 5/10 (ep. 50) fw 3.106s / bw 0.744s / total 3.850s || loss: 0.78, class: 0.71, bbox: 0.08 tr. batch 6/10 (ep. 50) fw 2.920s / bw 0.742s / total 3.662s || loss: 0.52, class: 0.49, bbox: 0.03 tr. batch 7/10 (ep. 50) fw 2.220s / bw 0.747s / total 2.967s || loss: 0.67, class: 0.56, bbox: 0.11 tr. batch 8/10 (ep. 50) fw 2.164s / bw 0.758s / total 2.921s || loss: 0.57, class: 0.51, bbox: 0.06 tr. batch 9/10 (ep. 50) fw 2.333s / bw 0.750s / total 3.082s || loss: 0.80, class: 0.70, bbox: 0.10 tr. batch 10/10 (ep. 50) fw 2.390s / bw 0.760s / total 3.150s || loss: 0.70, class: 0.66, bbox: 0.03 evaluating in mode train evaluating with match_iou: 0.1 starting validation in mode val_sampling. evaluating in mode val_sampling evaluating with match_iou: 0.1 non none scores: [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.33691776e-04 1.12577370e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 3.19541394e-06 0.00000000e+00 0.00000000e+00 0.00000000e+00 6.34394073e-05 3.46760788e-04 0.00000000e+00 6.57964466e-05 6.30265885e-06 1.83419772e-04 0.00000000e+00 0.00000000e+00 3.13401814e-05 0.00000000e+00 0.00000000e+00 8.20894272e-05 4.21034540e-06 1.00719716e-03 7.65382661e-07 1.39219383e-05 7.98896203e-04 0.00000000e+00 2.30329873e-04 2.08085640e-04 1.10898187e-06 0.00000000e+00 0.00000000e+00 1.11219310e-05 1.91517091e-04 1.70706726e-04 1.07269665e-06 0.00000000e+00 0.00000000e+00 4.47997328e-05 0.00000000e+00 1.04838946e-06 1.86664529e-03 5.89871320e-06 1.97787268e-04] trained epoch 50: took 212.29711294174194 sec. (41.897600412368774 train / 170.39951252937317 val) plotting predictions from validation sampling. starting testing model of fold 0 in exp LIDC-Retina-TrainTest feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] anchor scales: {'z': [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], 'xy': [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 using default pytorch weight init subset: selected 2 instances from df data set loaded with: 2 test patients tmp ensembling over rank_ix:0 epoch:LIDC-Retina-TrainTest/fold_0/48_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:1 epoch:LIDC-Retina-TrainTest/fold_0/29_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:2 epoch:LIDC-Retina-TrainTest/fold_0/32_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:3 epoch:LIDC-Retina-TrainTest/fold_0/17_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) tmp ensembling over rank_ix:4 epoch:LIDC-Retina-TrainTest/fold_0/34_best_params.pth evaluating patient 0009a for fold 0 forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) forwarding (patched) patient with shape: (180, 1, 128, 128, 64) evaluating patient 0003a for fold 0 forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) forwarding (patched) patient with shape: (216, 1, 128, 128, 64) finished predicting test set. starting post-processing of predictions. applying wcs to test set predictions with iou = 1e-05 and n_ens = 20. applying 2Dto3D merging to test set predictions with iou = 0.1. evaluating in mode test evaluating with match_iou: 0.1 /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /home/ivan/.virtualenvs/virtual-py3/lib/python3.5/site-packages/matplotlib/axes/_base.py:3364: UserWarning: Attempting to set identical bottom==top results in singular transformations; automatically expanding. bottom=1.0, top=1.0 self.set_ylim(upper, lower, auto=None) Logging to LIDC-Retina-TrainTest/fold_1/exec.log performing training in 3D over fold 1 on experiment LIDC-Retina-TrainTest with model retina_net performing training in 3D over fold 1 on experiment LIDC-Retina-TrainTest with model retina_net feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] feature map shapes: [[32 32 64] [16 16 32] [ 8 8 16] [ 4 4 8]] anchor scales: {'z': [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], 'xy': [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} anchor scales: {'z': [[2, 2.5198420997897464, 3.1748021039363987], [4, 5.039684199579493, 6.3496042078727974], [8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119]], 'xy': [[8, 10.079368399158986, 12.699208415745595], [16, 20.15873679831797, 25.39841683149119], [32, 40.31747359663594, 50.79683366298238], [64, 80.63494719327188, 101.59366732596476]]} level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 0: built anchors (589824, 6) / expected anchors 589824 ||| total build (589824, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 1: built anchors (73728, 6) / expected anchors 73728 ||| total build (663552, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 2: built anchors (9216, 6) / expected anchors 9216 ||| total build (672768, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 level 3: built anchors (1152, 6) / expected anchors 1152 ||| total build (673920, 6) / total expected 673920 using default pytorch weight init using default pytorch weight init loading dataset and initializing batch generators... loading dataset and initializing batch generators... data set loaded with: 6 train / 2 val / 2 test patients data set loaded with: 6 train / 2 val / 2 test patients starting training epoch 1 starting training epoch 1 tr. batch 1/10 (ep. 1) fw 1.901s / bw 0.557s / total 2.458s || loss: 0.55, class: 0.55, bbox: 0.00 tr. batch 1/10 (ep. 1) fw 1.901s / bw 0.557s / total 2.458s || loss: 0.55, class: 0.55, bbox: 0.00 tr. batch 2/10 (ep. 1) fw 2.057s / bw 0.777s / total 2.834s || loss: 0.77, class: 0.69, bbox: 0.08 tr. batch 2/10 (ep. 1) fw 2.057s / bw 0.777s / total 2.834s || loss: 0.77, class: 0.69, bbox: 0.08 tr. batch 3/10 (ep. 1) fw 1.838s / bw 0.515s / total 2.353s || loss: 0.77, class: 0.77, bbox: 0.00 tr. batch 3/10 (ep. 1) fw 1.838s / bw 0.515s / total 2.353s || loss: 0.77, class: 0.77, bbox: 0.00 tr. batch 4/10 (ep. 1) fw 1.803s / bw 0.741s / total 2.544s || loss: 0.94, class: 0.83, bbox: 0.11 tr. batch 4/10 (ep. 1) fw 1.803s / bw 0.741s / total 2.544s || loss: 0.94, class: 0.83, bbox: 0.11 tr. batch 5/10 (ep. 1) fw 1.717s / bw 0.741s / total 2.458s || loss: 0.85, class: 0.76, bbox: 0.09 tr. batch 5/10 (ep. 1) fw 1.717s / bw 0.741s / total 2.458s || loss: 0.85, class: 0.76, bbox: 0.09 tr. batch 6/10 (ep. 1) fw 1.654s / bw 0.744s / total 2.398s || loss: 1.07, class: 0.90, bbox: 0.17 tr. batch 6/10 (ep. 1) fw 1.654s / bw 0.744s / total 2.398s || loss: 1.07, class: 0.90, bbox: 0.17 tr. batch 7/10 (ep. 1) fw 2.217s / bw 0.742s / total 2.959s || loss: 0.80, class: 0.69, bbox: 0.11 tr. batch 7/10 (ep. 1) fw 2.217s / bw 0.742s / total 2.959s || loss: 0.80, class: 0.69, bbox: 0.11 tr. batch 8/10 (ep. 1) fw 1.733s / bw 0.740s / total 2.473s || loss: 0.80, class: 0.69, bbox: 0.12 tr. batch 8/10 (ep. 1) fw 1.733s / bw 0.740s / total 2.473s || loss: 0.80, class: 0.69, bbox: 0.12 tr. batch 9/10 (ep. 1) fw 1.709s / bw 0.750s / total 2.459s || loss: 1.07, class: 0.89, bbox: 0.18 tr. batch 9/10 (ep. 1) fw 1.709s / bw 0.750s / total 2.459s || loss: 1.07, class: 0.89, bbox: 0.18 tr. batch 10/10 (ep. 1) fw 2.189s / bw 0.743s / total 2.932s || loss: 1.06, class: 0.89, bbox: 0.17 tr. batch 10/10 (ep. 1) fw 2.189s / bw 0.743s / total 2.932s || loss: 1.06, class: 0.89, bbox: 0.17 evaluating in mode train evaluating in mode train evaluating with match_iou: 0.1 evaluating with match_iou: 0.1 starting validation in mode val_sampling.

It just stuck at starting validation for more than 4 hours. Please help me. Thank you in advance Sir.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/pfjaeger/medicaldetectiontoolkit/issues/35, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVQq7bgtW9ckQ9sdiJyf-iqLulPVqjMGks5vfeqpgaJpZM4cnCcY.

ivanwilliammd commented 5 years ago

Thank you for your answer Sir.

I am medical science graduate and currently on my computer science postgraduate studies (master degree), and I am planning to create some nodule detection using private lung CT scan dataset given to me by my uni. However, the data given to me are in DCM format with x, y, z coordinate instead of metadata information like LIDC. May I ask for your recommendation how to do this datasets?

For the reason why I used 10 datasets only and 10 epoch. Firstly because I want to try combine my own CT scan private dataset into the architecture, but before that I want to check what kind of output that will be produced by the RetinaNet3D architecture. The second reason, because I got the permission from my institution to use NVIDIA Tesla P100 16GB, so I think for 10 image, i will need only 10 epoch.

Could you give me some hint abount line 255 of dataloader, since I am quite new to PyTorch

          # if set to not None, add neighbouring slices to each selected slice in channel dimension.
            if self.cf.n_3D_context is not None:
                padded_data = dutils.pad_nd_image(data[0], [(data.shape[-1] + (self.cf.n_3D_context*2))], mode='constant')
                padded_slice_id = slice_id + self.cf.n_3D_context
                data = (np.concatenate([padded_data[..., ii][np.newaxis] for ii in range(
                    padded_slice_id - self.cf.n_3D_context, padded_slice_id + self.cf.n_3D_context + 1)], axis=0))
            **else:
                data = data[..., slice_id]
            seg = seg[..., slice_id]**

Thank you for your answer Sir Paul.

pfjaeger commented 5 years ago

It’s line 225 ...

On 10. Apr 2019, at 16:01, Ivan William H. notifications@github.com<mailto:notifications@github.com> wrote:

Thank you for your answer Sir. For the reason why I used 10 datasets only and 10 epoch. Firstly because I want to try combine my own CT scan private dataset into the architecture, but before that I want to check what kind of output that will be produced by the RetinaNet3D architecture. The second reason, because I got the permission from my institution to use NVIDIA Tesla P100 16GB, so I think for 10 image, i will need only 10 epoch.

Could you give me some hint abount line 255 of dataloader?

      # if set to not None, add neighbouring slices to each selected slice in channel dimension.
        if self.cf.n_3D_context is not None:
            padded_data = dutils.pad_nd_image(data[0], [(data.shape[-1] + (self.cf.n_3D_context*2))], mode='constant')
            padded_slice_id = slice_id + self.cf.n_3D_context
            data = (np.concatenate([padded_data[..., ii][np.newaxis] for ii in range(
                padded_slice_id - self.cf.n_3D_context, padded_slice_id + self.cf.n_3D_context + 1)], axis=0))
        **else:
            data = data[..., slice_id]
        seg = seg[..., slice_id]**

Thank you Sir

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/pfjaeger/medicaldetectiontoolkit/issues/35#issuecomment-481703532, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVQq7U4-h2z6qIiF2_JNFIMaWmLxGN_tks5vfe6jgaJpZM4cnCcY.

ivanwilliammd commented 5 years ago

It’s line 225 ... On 10. Apr 2019, at 16:01, Ivan William H. notifications@github.com<mailto:notifications@github.com> wrote: Thank you for your answer Sir. For the reason why I used 10 datasets only and 10 epoch. Firstly because I want to try combine my own CT scan private dataset into the architecture, but before that I want to check what kind of output that will be produced by the RetinaNet3D architecture. The second reason, because I got the permission from my institution to use NVIDIA Tesla P100 16GB, so I think for 10 image, i will need only 10 epoch. Could you give me some hint abount line 255 of dataloader? # if set to not None, add neighbouring slices to each selected slice in channel dimension. if self.cf.n_3D_context is not None: padded_data = dutils.pad_nd_image(data[0], [(data.shape[-1] + (self.cf.n_3D_context*2))], mode='constant') padded_slice_id = slice_id + self.cf.n_3D_context data = (np.concatenate([padded_data[..., ii][np.newaxis] for ii in range( padded_slice_id - self.cf.n_3D_context, padded_slice_id + self.cf.n_3D_context + 1)], axis=0)) else: data = data[..., slice_id] seg = seg[..., slice_id] Thank you Sir — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#35 (comment)>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVQq7U4-h2z6qIiF2_JNFIMaWmLxGN_tks5vfe6jgaJpZM4cnCcY.

Thank you Sir. I will try increase my patients data first.

ivanwilliammd commented 5 years ago

I'm sorry Sir, may I ask does this affect for manual --folds command?

python exec.py --mode train_test --folds 0 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest works smoothly but

python exec.py --mode train_test --folds 1 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest
python exec.py --mode train_test --folds 2 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest
python exec.py --mode train_test --folds 3 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest
python exec.py --mode train_test --folds 4 --exp_source experiments/lidc_exp/ --exp_dir LIDC-Retina-TrainTest

stuck after first epoch. @pfjaeger And to solve this error, is it true that I just need to increase the patients dataset first, is that right Sir? Thank you Sir

MIC-DKFZ / medicaldetectiontoolkit

HELP! Error during training LIDC dataset #35