Error during training - Githubissues

theodupuis commented 3 years ago

Bug : During the training phase File "/anaconda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 161, in scale assert outputs.is_cuda or outputs.device.type == 'xla' AssertionError Exception ignored in: <function tqdm.del at 0x7f9ba338de50> Traceback (most recent call last): File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1145, in del File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1299, in close File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1492, in display File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1148, in str File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1450, in format_dict TypeError: cannot unpack non-iterable NoneType object

Environment Please provide some information about the used environment. Env from the set up using source and not docker Cmd : nndet_train 1000 --sweep

It seems the issue is related to the fact that TensorMetric not updated to cuda device. The same issue as adressed on https://github.com/PyTorchLightning/pytorch-lightning/issues/2274.

theodupuis commented 3 years ago

Been treated in recent commits

mibaumgartner commented 3 years ago

Hi @theodupuis ,

I'll look into a better solution for this one.

The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now.

Best, Michael

theodupuis commented 3 years ago

Hi,

Thank you for your help !

Best regards Théo

Téléchargez Outlook pour iOShttps://aka.ms/o0ukef

De : Michael Baumgartner @.> Envoyé : Thursday, August 19, 2021 4:28:24 PM À : MIC-DKFZ/nnDetection @.> Cc : Theo Dupuis (Student at CentraleSupelec) @.>; Mention @.> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)

Hi @theodupuishttps://github.com/theodupuis ,

I'll look into a better solution for this one.

The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now.

Best, Michael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MIC-DKFZ/nnDetection/issues/25#issuecomment-901963085, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AVIST6I7V2YUA5QLKVF56WLT5UIIRANCNFSM5COKKMTA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

theodupuis commented 3 years ago

Hi, One last question if I may, now that the training is running, I reached epoch 3 overnight (500images 512x512x100) but is it normal that it takes so much Time?

Téléchargez Outlook pour iOShttps://aka.ms/o0ukef

De : Theo Dupuis (Student at CentraleSupelec) @.> Envoyé : Thursday, August 19, 2021 4:38:40 PM À : MIC-DKFZ/nnDetection @.>; MIC-DKFZ/nnDetection @.> Cc : Mention @.> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)

Hi,

Thank you for your help !

Best regards Théo

Téléchargez Outlook pour iOShttps://aka.ms/o0ukef

De : Michael Baumgartner @.> Envoyé : Thursday, August 19, 2021 4:28:24 PM À : MIC-DKFZ/nnDetection @.> Cc : Theo Dupuis (Student at CentraleSupelec) @.>; Mention @.> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)

Hi @theodupuishttps://github.com/theodupuis ,

I'll look into a better solution for this one.

The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now.

Best, Michael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MIC-DKFZ/nnDetection/issues/25#issuecomment-901963085, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AVIST6I7V2YUA5QLKVF56WLT5UIIRANCNFSM5COKKMTA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

mibaumgartner commented 3 years ago

Hi,

the training time of nnDetection should be roughly equal for most (there are some exceptions) data sets: 2 days with mixed precision 3d speed up and 4 days without. Your time sounds quite slow though. Generally speaking there could be two reasons:

PyTorch < 1.9 did not provide training speedup for mixed-precision 3d convs in their pip installable version and it was necessary to build it from source. I didn't test PyTorch 1.9 yet. (the docker build of nnDetection also provides the speedup)
There is a bottleneck in your configuration / setup. This can be identified as follows: Check the GPU Util -> it should be high for most of the time if it isn't, there is either a CPU or IO bottleneck. If it is high it is the missing pytorch speed up. Check CPU util: if the CPU util is high (and the GPU util isn't) more cpu threads are needed for augmentation (can be adjusted via det_num_threads and depends on your CPU). If GPU and CPU util are low, it is an IO bottleneck, it is quite hard to do anything about this (a typical SSD with ~500mb/s read speed ran fine for my experiments).

Best, Michael

theodupuis commented 3 years ago

Hi,

Thank you for all your answers. One last question and I stop bothering you, if I understand well the algorithm you train several models with different parameters to choose the so called « empirical parameters » right ? Hence the 5 days of training. Thus if this is true how many models are created during the training phase ?

Best regards Théo

Téléchargez Outlook pour iOShttps://aka.ms/o0ukef

De : Michael Baumgartner @.> Envoyé : Friday, August 20, 2021 11:06:12 AM À : MIC-DKFZ/nnDetection @.> Cc : Theo Dupuis (Student at CentraleSupelec) @.>; Mention @.> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)

Hi,

the training time of nnDetection should be roughly equal for most (there are some exceptions) data sets: 2 days with mixed precision 3d speed up and 4 days without. Your time sounds quite slow though. Generally speaking there could be two reasons:

PyTorch < 1.9 did not provide training speedup for mixed-precision 3d convs in their pip installable version and it was necessary to build it from source. I didn't test PyTorch 1.9 yet. (the docker build of nnDetection also provides the speedup)
There is a bottleneck in your configuration / setup. This can be identified as follows: Check the GPU Util -> it should be high for most of the time if it isn't, there is either a CPU or IO bottleneck. If it is high it is the missing pytorch speed up. Check CPU util: if the CPU util is high (and the GPU util isn't) more cpu threads are needed for augmentation (can be adjusted via det_num_threads and depends on your CPU). If GPU and CPU util are low, it is an IO bottleneck, it is quite hard to do anything about this (a typical SSD with ~500mb/s read speed ran fine for my experiments).

Best, Michael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MIC-DKFZ/nnDetection/issues/25#issuecomment-902549885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AVIST6IP27L6G247QDNPJGLT5YLIJANCNFSM5COKKMTA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

mibaumgartner commented 3 years ago

Hi @theodupuis ,

during the training process, only a single model is trained. The empirical parameters refer to several postprocessing parameters (i.e. IoU threshold for NMS, IoU threshold for Weighted Box Clustering) which do not require additional models (it is not a classical Auto ML approach where models are trained several times). Those parameters are optimized by empirically trying them on the validation data.

Best, Michael

MIC-DKFZ / nnDetection

Error during training #25