Open DanielNajarian opened 1 year ago
Are you running notebook inside docker container? It looks like a dependency issue (running notebook with different versions of dependency). Please see https://ploomber.io/blog/notebook-to-docker/ for reference how to run Jupyter Notebook inside container.
I'm running it through command line and built the environment based on their requirements files.
What versions for PyTorch and NVIDIA DALI are you using?
I am using torch 1.13.1+cu116 and nvidia-dali-cuda110 1.26.0. Looking at it now, DALI should be cuda116, correct? But there doesnt seem to be a cuda116 version of it.
22.11 container has 1.18.0 DALI version (see here). Were you manually reinstalling it?
I had to manually reinstall a few packages since the torch and torchvision CUDA versions weren't lined up and I had trouble getting 117 to work on both, so I went down to 116 and changed some stuff as a result.
Should I be focusing on 22.02 container since it lines up with CUDA 11.6, which is my torch version? This would be DALI 1.10.
You can experiment with different versions. I would start with DALI 1.18.0 (or not reinstalling it inside container).
What error log you had during running container without any modification?
Hi, have you figured it out?
When running the BraTS 2021 notebook (located at PyTorch/Segmentation/nnUNet/notebooks/BraTS21.ipynb) training section, the model is not properly training even though it is going through the steps, as seen in the image below. The Dice is stuck at an extremely low value and neither that nor the loss changes at all over the epochs. The "DALI iterator does not support resetting while epoch is not finished" warning comes up on every epoch but that is not something that I have touched.
To Reproduce Steps to reproduce the behavior:
Expected behavior I expected the model to train and have at least a Dice of 70 after 5 epochs
Environment Please provide at least: