Closed ecm200 closed 3 years ago
Hi @ecm200
This cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
seems like a CUDA / CUDNN mismatch issue.
It seems from the log clearml-agent
installed the correct pytorch version (based on the auto detected CUDA 11.1 version).
Is this the same setup that worked on your development machine ?
(Basically I suspect this is not a direct issue of clearml
but a cuda/pytorch thing)
BTW: Running the clearml-agent
would solve such issues, as you will have the ability to launch the code inside a container with the correct CUDA support.
Hi @bmartinn,
I think you're right, however I think this boils down to how you build environments on the remote machine.
When I set up my development environments on remote machines to work directly on them (i.e. not with ClearML) I tend to default to Conda for most packages and then use PIP when packages are not available on CondaCloud.
I have gone onto the remote compute resource, and executed the code locally, using the virtual environment created by the clearml-agent, and I get the same error.
I have also created a new conda environment on the same machine, using mostly conda to install the package dependencies as described above, and executed the exact same code as I did before, but it is now computing fine and iterating and logging as expected into clearml-server.
(py38_pytorch18) edmorris@ecm-clearml-compute-gpu-001:~/.clearml/venvs-builds/3.8/task_repository/caltech_birds.git/scripts$ python local_train_clearml_pytorch_ignite_caltech_birds.py --config configs/torchvision/resnet34_config.yaml
usage: local_train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE] [--opts ...]
PyTorch Image Classification Trainer - Ed Morris (c) 2021
optional arguments:
-h, --help show this help message and exit
--config FILE Path and name of configuration file for training. Should be a .yaml file.
--opts ... Modify config options using the command-line 'KEY VALUE' pairs
ClearML Task: overwriting (reusing) task id=ea8903a29bf443d5ab469f9c56c2a8b5
ClearML results page: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8080/projects/30034b3199e24123896c8eff9bf16d29/experiments/ea8903a29bf443d5ab469f9c56c2a8b5/output/log
2021-05-20 09:29:55,856 - clearml.task - WARNING - Requirement ignored, Task.add_requirements() must be called before Task.init()
2021-05-20 09:29:55,861 - clearml.task - WARNING - Requirement ignored, Task.add_requirements() must be called before Task.init()
{'MODEL.MODEL_LIBRARY': 'torchvision', 'MODEL.MODEL_NAME': 'resnet34', 'MODEL.PRETRAINED': True, 'MODEL.WITH_AMP': False, 'MODEL.WITH_GRAD_SCALE': False, 'TRAIN.BATCH_SIZE': 16, 'TRAIN.NUM_WORKERS': 4, 'TRAIN.NUM_EPOCHS': 40, 'TRAIN.LOSS.CRITERION': 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE': 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov': True, 'TRAIN.SCHEDULER.TYPE': 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size': 7, 'TRAIN.SCHEDULER.PARAMS.gamma': 0.1, 'EARLY_STOPPING_PATIENCE': 5, 'DIRS.ROOT_DIR': '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR': 'models/classification', 'DIRS.CLEAN_UP': True, 'DATA.DATA_DIR': 'data/images', 'DATA.TRAIN_DIR': 'train', 'DATA.TEST_DIR': 'test', 'DATA.NUM_CLASSES': 200, 'DATA.TRANSFORMS.TYPE': 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size': 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize': 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type': 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale': 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range': (-10.0, 10.0), 'SYSTEM.LOG_HISTORY': True}
[INFO] Getting a local copy of the CUB200 birds datasets
[INFO] Default location of training dataset:: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8081
[INFO] Default location of training dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_0ccff21334e84b3d8e0618c5f1734cc8
[INFO] Default location of testing dataset:: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8081
[INFO] Default location of testing dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_b435c4ffda374bca83d9a746137dc3ca
[INFO] Task output destination::
[INFO] Final parameter list passed to Trainer object:: ['MODEL.MODEL_LIBRARY', 'torchvision', 'MODEL.MODEL_NAME', 'resnet34', 'MODEL.PRETRAINED', True, 'MODEL.WITH_AMP', False, 'MODEL.WITH_GRAD_SCALE', False, 'TRAIN.BATCH_SIZE', 16, 'TRAIN.NUM_WORKERS', 4, 'TRAIN.NUM_EPOCHS', 40, 'TRAIN.LOSS.CRITERION', 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE', 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr', 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum', 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov', True, 'TRAIN.SCHEDULER.TYPE', 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size', 7, 'TRAIN.SCHEDULER.PARAMS.gamma', 0.1, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS.ROOT_DIR', '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR', 'models/classification', 'DIRS.CLEAN_UP', True, 'DATA.DATA_DIR', 'data/images', 'DATA.TRAIN_DIR', 'train', 'DATA.TEST_DIR', 'test', 'DATA.NUM_CLASSES', 200, 'DATA.TRANSFORMS.TYPE', 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size', 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize', 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type', 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale', 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range', (-10.0, 10.0), 'SYSTEM.LOG_HISTORY', True, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/ea8903a29bf443d5ab469f9c56c2a8b5']
[INFO] Parameters Override:: ['MODEL.MODEL_LIBRARY', 'torchvision', 'MODEL.MODEL_NAME', 'resnet34', 'MODEL.PRETRAINED', True, 'MODEL.WITH_AMP', False, 'MODEL.WITH_GRAD_SCALE', False, 'TRAIN.BATCH_SIZE', 16, 'TRAIN.NUM_WORKERS', 4, 'TRAIN.NUM_EPOCHS', 40, 'TRAIN.LOSS.CRITERION', 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE', 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr', 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum', 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov', True, 'TRAIN.SCHEDULER.TYPE', 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size', 7, 'TRAIN.SCHEDULER.PARAMS.gamma', 0.1, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS.ROOT_DIR', '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR', 'models/classification', 'DIRS.CLEAN_UP', True, 'DATA.DATA_DIR', 'data/images', 'DATA.TRAIN_DIR', 'train', 'DATA.TEST_DIR', 'test', 'DATA.NUM_CLASSES', 200, 'DATA.TRANSFORMS.TYPE', 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size', 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize', 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type', 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale', 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range', (-10.0, 10.0), 'SYSTEM.LOG_HISTORY', True, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/ea8903a29bf443d5ab469f9c56c2a8b5']
DATA:
DATA_DIR: /home/edmorris/.clearml/cache/storage_manager/datasets
NUM_CLASSES: 200
TEST_DIR: ds_b435c4ffda374bca83d9a746137dc3ca
TRAIN_DIR: ds_0ccff21334e84b3d8e0618c5f1734cc8
TRANSFORMS:
PARAMS:
AGGRESIVE:
persp_distortion_scale: 0.25
rotation_range: (-10.0, 10.0)
type: all
DEFAULT:
img_crop_size: 224
img_resize: 256
TYPE: default
DIRS:
CLEAN_UP: True
ROOT_DIR:
WORKING_DIR: /home/edmorris/.clearml/cache/ea8903a29bf443d5ab469f9c56c2a8b5/ignite_resnet34
EARLY_STOPPING_PATIENCE: 5
MODEL:
MODEL_LIBRARY: torchvision
MODEL_NAME: resnet34
PRETRAINED: True
WITH_AMP: False
WITH_GRAD_SCALE: False
SYSTEM:
LOG_HISTORY: True
TRAIN:
BATCH_SIZE: 16
LOSS:
CRITERION: CrossEntropy
NUM_EPOCHS: 40
NUM_WORKERS: 4
OPTIMIZER:
PARAMS:
lr: 0.001
momentum: 0.9
nesterov: True
TYPE: SGD
SCHEDULER:
PARAMS:
gamma: 0.1
step_size: 7
TYPE: StepLR
[INFO] Creating data transforms...
[INFO] Creating data loaders...
***********************************************
** DATASET SUMMARY **
***********************************************
train size:: 5994 images
test size:: 5794 images
Number of classes:: 200
***********************************************
[INFO] Created data loaders.
[INFO] Creating the model...
2021-05-20 09:29:57,271 - clearml.model - INFO - Selected model id: 8df52efca2684e5f8b727fa928623a82
[INFO] Successfully created model and pushed it to the device cuda:0
[INFO] Creating optimizer...
[INFO] Successfully created optimizer object.
[INFO] Successfully created learning rate scheduler object.
[INFO] Trainer pass OK for training.
Tensorboard Logging...done
[INFO] Creating callback functions for training loop...Early Stopping (5 epochs)...Model Checkpointing...Done
[INFO] Executing model training...
Epoch: 0001 TrAcc: 0.301 ValAcc: 0.298 TrPrec: 0.406 ValPrec: 0.380 TrRec: 0.301 ValRec: 0.302 TrF1: 0.267 ValF1: 0.258 TrTopK: 0.614 ValTopK: 0.647 TrLoss: 3.509 ValLoss: 3.293
Epoch: 0002 TrAcc: 0.480 ValAcc: 0.492 TrPrec: 0.573 ValPrec: 0.580 TrRec: 0.480 ValRec: 0.496 TrF1: 0.461 ValF1: 0.465 TrTopK: 0.786 ValTopK: 0.832 TrLoss: 2.349 ValLoss: 2.117
Epoch: 0003 TrAcc: 0.593 ValAcc: 0.606 TrPrec: 0.652 ValPrec: 0.656 TrRec: 0.593 ValRec: 0.609 TrF1: 0.583 ValF1: 0.590 TrTopK: 0.848 ValTopK: 0.889 TrLoss: 1.819 ValLoss: 1.569
Epoch: 0004 TrAcc: 0.661 ValAcc: 0.659 TrPrec: 0.705 ValPrec: 0.695 TrRec: 0.661 ValRec: 0.661 TrF1: 0.656 ValF1: 0.651 TrTopK: 0.869 ValTopK: 0.907 TrLoss: 1.493 ValLoss: 1.315
Epoch: 0005 TrAcc: 0.693 ValAcc: 0.688 TrPrec: 0.742 ValPrec: 0.726 TrRec: 0.693 ValRec: 0.692 TrF1: 0.692 ValF1: 0.682 TrTopK: 0.894 ValTopK: 0.926 TrLoss: 1.290 ValLoss: 1.173
Epoch: 0006 TrAcc: 0.726 ValAcc: 0.719 TrPrec: 0.763 ValPrec: 0.740 TrRec: 0.726 ValRec: 0.720 TrF1: 0.726 ValF1: 0.712 TrTopK: 0.901 ValTopK: 0.931 TrLoss: 1.176 ValLoss: 1.054
Epoch: 0007 TrAcc: 0.746 ValAcc: 0.728 TrPrec: 0.779 ValPrec: 0.743 TrRec: 0.746 ValRec: 0.730 TrF1: 0.747 ValF1: 0.722 TrTopK: 0.908 ValTopK: 0.938 TrLoss: 1.068 ValLoss: 0.975
Epoch: 0008 TrAcc: 0.781 ValAcc: 0.763 TrPrec: 0.796 ValPrec: 0.769 TrRec: 0.781 ValRec: 0.766 TrF1: 0.782 ValF1: 0.760 TrTopK: 0.924 ValTopK: 0.946 TrLoss: 0.944 ValLoss: 0.888
Epoch: 0009 TrAcc: 0.784 ValAcc: 0.770 TrPrec: 0.798 ValPrec: 0.774 TrRec: 0.784 ValRec: 0.772 TrF1: 0.786 ValF1: 0.768 TrTopK: 0.923 ValTopK: 0.948 TrLoss: 0.918 ValLoss: 0.864
So it does look like here that it is something to do with the PyTorch installation using PIP, as this is difference between the clearml-agent derived environment and the manually created conda environment I have created.
So I am wondering, if you have an issue like this, where you have a package selection that is mainly in say Conda, but there are a few packages in PIP, then how can this be handled?
I could create a YAML file of the Conda environment that is now successfully running the code, and that could be used to create a conda environment, but how can this be used in conjunction with clearml to do that automatically when an experiment is cloned and executed?
@bmartinn
Update, I created another environment manually on the compute server, using CONDA to create the environment object, but then I installed all packages, including PyTorch using PIP and NOT CONDA. I made sure the versions matched those picked up by the dependency map created by clearml.
Executing the code in this environment caused the same issue as the clearml-agent created environment, as these both installed PyTorch using PIP.
So is there anyway to use both CONDA and PIP, like I do when creating environments manually, so to install most from CONDA and what isn't available using PIP?
The YAML file created by a CONDA environment creates a package list that differentiates between package sources, either CONDA or PIP. The call is as follows, and should be run inside the environment you want to get the details of:
conda env export > environment_specs.yml
This results in a YAML file as follows:
name: py38_pytorch18
channels:
- pytorch
- nvidia
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- absl-py=0.12.0=py38h06a4308_0
- aiohttp=3.6.3=py38h7b6447c_0
- async-timeout=3.0.1=py38h06a4308_0
- attrs=21.2.0=pyhd3eb1b0_0
- blas=1.0=mkl
- blinker=1.4=py38h06a4308_0
- blosc=1.19.0=hd408876_0
- brotli=1.0.9=he6710b0_2
- brotlipy=0.7.0=py38h27cfd23_1003
- bzip2=1.0.8=h7b6447c_0
- c-ares=1.17.1=h27cfd23_0
- ca-certificates=2021.4.13=h06a4308_1
- cachetools=4.2.2=pyhd3eb1b0_0
- certifi=2020.12.5=py38h06a4308_0
- cffi=1.14.5=py38h261ae71_0
- chardet=3.0.4=py38h06a4308_1003
- charls=2.1.0=he6710b0_2
- click=8.0.0=pyhd3eb1b0_0
- cloudpickle=1.6.0=py_0
- coverage=5.5=py38h27cfd23_2
- cryptography=3.4.7=py38hd23ed53_0
- cudatoolkit=11.1.74=h6bb024c_0
- cycler=0.10.0=py38_0
- cython=0.29.23=py38h2531618_0
- cytoolz=0.11.0=py38h7b6447c_0
- dask-core=2021.5.0=pyhd3eb1b0_0
- dbus=1.13.18=hb2f20db_0
- decorator=5.0.9=pyhd3eb1b0_0
- expat=2.3.0=h2531618_2
- ffmpeg=4.3=hf484d3e_0
- fontconfig=2.13.1=h6c09931_0
- freetype=2.10.4=h5ab3b9f_0
- fsspec=0.9.0=pyhd3eb1b0_0
- giflib=5.1.4=h14c3975_1
- glib=2.68.2=h36276a3_0
- gmp=6.2.1=h2531618_2
- gnutls=3.6.15=he1e5248_0
- google-auth=1.30.0=pyhd3eb1b0_0
- google-auth-oauthlib=0.4.4=pyhd3eb1b0_0
- grpcio=1.36.1=py38h2157cd5_1
- gst-plugins-base=1.14.0=h8213a91_2
- gstreamer=1.14.0=h28cd5cc_2
- icu=58.2=he6710b0_3
- idna=2.10=pyhd3eb1b0_0
- ignite=0.4.4=py_0
- imagecodecs=2020.5.30=py38h567f118_1
- imageio=2.9.0=pyhd3eb1b0_0
- importlib-metadata=3.10.0=py38h06a4308_0
- intel-openmp=2021.2.0=h06a4308_610
- joblib=1.0.1=pyhd3eb1b0_0
- jpeg=9b=h024ee3a_2
- jxrlib=1.1=h7b6447c_2
- kiwisolver=1.3.1=py38h2531618_0
- lame=3.100=h7b6447c_0
- lcms2=2.12=h3be6417_0
- ld_impl_linux-64=2.33.1=h53a641e_7
- libaec=1.0.4=he6710b0_1
- libffi=3.3=he6710b0_2
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libiconv=1.15=h63c8f33_5
- libidn2=2.3.1=h27cfd23_0
- libpng=1.6.37=hbc83047_0
- libprotobuf=3.14.0=h8c45485_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- libtasn1=4.16.0=h27cfd23_0
- libtiff=4.1.0=h2733197_1
- libunistring=0.9.10=h27cfd23_0
- libuuid=1.0.3=h1bed415_2
- libuv=1.40.0=h7b6447c_0
- libwebp=1.0.1=h8e7db2f_0
- libxcb=1.14=h7b6447c_0
- libxml2=2.9.10=hb55368b_3
- libzopfli=1.0.3=he6710b0_0
- locket=0.2.1=py38h06a4308_1
- lz4-c=1.9.3=h2531618_0
- markdown=3.3.4=py38h06a4308_0
- matplotlib=3.3.4=py38h06a4308_0
- matplotlib-base=3.3.4=py38h62a2d02_0
- mkl=2021.2.0=h06a4308_296
- mkl-service=2.3.0=py38h27cfd23_1
- mkl_fft=1.3.0=py38h42c9631_2
- mkl_random=1.2.1=py38ha9443f7_2
- multidict=4.7.6=py38h7b6447c_1
- ncurses=6.2=he6710b0_1
- nettle=3.7.2=hbbd107a_1
- networkx=2.5=py_0
- ninja=1.10.2=hff7bd54_1
- numpy=1.20.2=py38h2d18471_0
- numpy-base=1.20.2=py38hfae3a4d_0
- oauthlib=3.1.0=py_0
- olefile=0.46=py_0
- openh264=2.1.0=hd408876_0
- openjpeg=2.3.0=h05c96fa_1
- openssl=1.1.1k=h27cfd23_0
- pandas=1.2.4=py38h2531618_0
- partd=1.2.0=pyhd3eb1b0_0
- pcre=8.44=he6710b0_0
- pillow=8.2.0=py38he98fc37_0
- pip=21.0.1=py38h06a4308_0
- protobuf=3.14.0=py38h2531618_1
- pyasn1=0.4.8=py_0
- pyasn1-modules=0.2.8=py_0
- pycparser=2.20=py_2
- pyjwt=1.7.1=py38_0
- pyopenssl=20.0.1=pyhd3eb1b0_1
- pyparsing=2.4.7=pyhd3eb1b0_0
- pyqt=5.9.2=py38h05f1152_4
- pysocks=1.7.1=py38h06a4308_0
- python=3.8.10=hdb3f193_7
- python-dateutil=2.8.1=pyhd3eb1b0_0
- pytorch=1.8.1=py3.8_cuda11.1_cudnn8.0.5_0
- pytz=2021.1=pyhd3eb1b0_0
- pywavelets=1.1.1=py38h7b6447c_2
- pyyaml=5.4.1=py38h27cfd23_1
- qt=5.9.7=h5867ecd_1
- readline=8.1=h27cfd23_0
- requests=2.25.1=pyhd3eb1b0_0
- requests-oauthlib=1.3.0=py_0
- rsa=4.7.2=pyhd3eb1b0_1
- scikit-image=0.18.1=py38ha9443f7_0
- scikit-learn=0.24.2=py38ha9443f7_0
- scipy=1.6.2=py38had2a1c9_1
- setuptools=52.0.0=py38h06a4308_0
- sip=4.19.13=py38he6710b0_0
- six=1.15.0=py38h06a4308_0
- snappy=1.1.8=he6710b0_0
- sqlite=3.35.4=hdfb4753_0
- tensorboard=2.4.0=pyhc547734_0
- tensorboard-plugin-wit=1.6.0=py_0
- threadpoolctl=2.1.0=pyh5ca1d4c_0
- tifffile=2021.3.31=pyhd3eb1b0_1
- tk=8.6.10=hbc83047_0
- toolz=0.11.1=pyhd3eb1b0_0
- torchaudio=0.8.1=py38
- torchvision=0.9.1=py38_cu111
- tornado=6.1=py38h27cfd23_0
- typing_extensions=3.7.4.3=pyha847dfd_0
- urllib3=1.26.4=pyhd3eb1b0_0
- werkzeug=1.0.1=pyhd3eb1b0_0
- wheel=0.36.2=pyhd3eb1b0_0
- xz=5.2.5=h7b6447c_0
- yaml=0.2.5=h7b6447c_0
- yarl=1.6.3=py38h27cfd23_0
- zipp=3.4.1=pyhd3eb1b0_0
- zlib=1.2.11=h7b6447c_3
- zstd=1.4.9=haebb681_0
- pip:
- backcall==0.2.0
- clearml==1.0.2
- coveralls==3.0.1
- cub-tools==1.0.0
- docopt==0.6.2
- furl==2.1.2
- humanfriendly==9.1
- imutils==0.5.4
- iniconfig==1.1.1
- ipython==7.23.1
- ipython-genutils==0.2.0
- jedi==0.18.0
- jsonschema==3.2.0
- kornia==0.4.1
- matplotlib-inline==0.1.2
- orderedmultidict==1.0.1
- packaging==20.9
- parso==0.8.2
- pathlib2==2.3.5
- pexpect==4.8.0
- pickleshare==0.7.5
- pluggy==0.13.1
- prompt-toolkit==3.0.18
- ptyprocess==0.7.0
- py==1.10.0
- pygments==2.9.0
- pyrsistent==0.17.3
- pytest==6.2.4
- pytest-mock==3.6.1
- pytorchcv==0.0.65
- timm==0.4.9
- toml==0.10.2
- torch-lucent==0.1.8
- tqdm==4.60.0
- traitlets==5.0.5
- wcwidth==0.2.5
- yacs==0.1.8
prefix: /home/edmorris/.conda/envs/py38_pytorch18
@bmartinn
Confirmed, this is an issue with the PyTorch installation using PIP as the package manager. It's been open 27 days and it doesn't look like there has been a resolution other than to use CONDA to install PyTorch into a virtual environment.
@bmartinn,
So my question here is how best to control environment creation on the remote compute end by clearml-agents, when there is a combined requirement for using both Conda and PIP.
@bmartinn
Thanks for all your tips and help.
After a lot experimenting with various clearml-agent options, including running in docker mode, I started again using the conda package manager as the virtual environment creator on the remote compute node. This time, I was able to see that the clearml-agent was able to use PIP to install additional packages, if they could not be resolved using CONDA. This means that PyTorch was installed using the recommended CONDA method and therefore circumented the issues found with using PIP to install PyTorch, as detailed above.
This has led to a successful creation of a training environment on the remote compute and a successful training of PyTorch models.
Description of issue
I am having an issue getting a PyTorch model to train on a remote compute server using clearml and I am wondering if it is something to do with the virtual environment. I have tried all three methods pip venv, conda venv and docker. The closest have something working is using the default pip package venv method. The model runs for an iteration and then crashes with cudnn error. I can’t help thinking this is something to do with the pip installation of PyTorch as they recommended using the conda channel to install it. When I setup my local virtual environment I use a combination of Conda and pip. I use conda as my environment manager, and then use pip for packages that are not in the conda repositories.
I am running this on a bespoke azure vm image that I created, with 10.1, 10.2, 11.1 and 11.2 CUDA versions installed, and the correct CUDNN libraries supported by the latest nvidia driver. I have verified that the CUDA drivers are working as I have been able to train models in conda environments directly on the machine, but not using clearml. I use the same trainer classes that I have written using Ignite, that I have used with the clearml experiments.
I have made sure that the CUDA version is correctly specified for the version being used by the PyTorch installation, and I have been running the clearml-agent in its own conda environment, with the environment variables pointing towards the correct version CUDA, in this case 11.1. I have verified that the different versions of CUDA work properly by setting up conda environments with the different versions of PyTorch and successfully trained models using 10.2, 11.1 and 11.2, outside of clearml However I run into the same cudnn error on the forward calculation of first iteration when I run it through clearml.
I have also tried training a variety of network architectures from a number of libraries (Torchvision, pytorchcv, TIMM), as well as a simple VGG implementation from scratch, and come across the same issues.
Is this potentially an issue with having multiple CUDA versions installed on the server?
The CUDNN error on execution by clearml-agent
Following execution of the experiment on the remote compute resource, the model trains for an iteration, and then fails with the following error.
clearml-agent environment setup logs
The following below shows the terminal logging of the environment setup on execution of the experiment by the clearml-server.
PyTorch Ignite training script
Below is the trainer script, which has been modified to run with clearml. A version of this script without the clearml interface has successfully trained these models on the compute server in a conda environment.