Issue executing PyTorch training on agent remote compute resource

ecm200 commented 3 years ago

Description of issue

I am having an issue getting a PyTorch model to train on a remote compute server using clearml and I am wondering if it is something to do with the virtual environment. I have tried all three methods pip venv, conda venv and docker. The closest have something working is using the default pip package venv method. The model runs for an iteration and then crashes with cudnn error. I can’t help thinking this is something to do with the pip installation of PyTorch as they recommended using the conda channel to install it. When I setup my local virtual environment I use a combination of Conda and pip. I use conda as my environment manager, and then use pip for packages that are not in the conda repositories.

I am running this on a bespoke azure vm image that I created, with 10.1, 10.2, 11.1 and 11.2 CUDA versions installed, and the correct CUDNN libraries supported by the latest nvidia driver. I have verified that the CUDA drivers are working as I have been able to train models in conda environments directly on the machine, but not using clearml. I use the same trainer classes that I have written using Ignite, that I have used with the clearml experiments.

I have made sure that the CUDA version is correctly specified for the version being used by the PyTorch installation, and I have been running the clearml-agent in its own conda environment, with the environment variables pointing towards the correct version CUDA, in this case 11.1. I have verified that the different versions of CUDA work properly by setting up conda environments with the different versions of PyTorch and successfully trained models using 10.2, 11.1 and 11.2, outside of clearml However I run into the same cudnn error on the forward calculation of first iteration when I run it through clearml.

I have also tried training a variety of network architectures from a number of libraries (Torchvision, pytorchcv, TIMM), as well as a simple VGG implementation from scratch, and come across the same issues.

Is this potentially an issue with having multiple CUDA versions installed on the server?

The CUDNN error on execution by clearml-agent

Following execution of the experiment on the remote compute resource, the model trains for an iteration, and then fails with the following error.

Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x564675040c30
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
output: TensorDescriptor 0x564674fa4210
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
weight: FilterDescriptor 0x564674fa1b60
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 128, 128, 3, 3, 
Pointer addresses: 
    input: 0x7f151ec40000
    output: 0x7f1518000000
    weight: 0x7f154cd2e400
Forward algorithm: 7

Engine run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x564675040c30
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
output: TensorDescriptor 0x564674fa4210
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
weight: FilterDescriptor 0x564674fa1b60
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 128, 128, 3, 3, 
Pointer addresses: 
    input: 0x7f151ec40000
    output: 0x7f1518000000
    weight: 0x7f154cd2e400
Forward algorithm: 7

Traceback (most recent call last):
  File "train_clearml_pytorch_ignite_caltech_birds.py", line 104, in <module>
    trainer.run()
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/cub_tools/trainer.py", line 640, in run
    self.train_engine.run(self.train_loader, max_epochs=self.config.TRAIN.NUM_EPOCHS)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run
    return self._internal_run()
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run
    self._handle_exception(e)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
    raise e
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 745, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
    raise e
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/cub_tools/trainer.py", line 448, in train_step
    y_pred = self.model(x)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 249, in forward
    return self._forward_impl(x)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 238, in _forward_impl
    x = self.layer2(x)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 74, in forward
    out = self.conv2(out)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x564675040c30
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
output: TensorDescriptor 0x564674fa4210
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
weight: FilterDescriptor 0x564674fa1b60
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 128, 128, 3, 3, 
Pointer addresses: 
    input: 0x7f151ec40000
    output: 0x7f1518000000
    weight: 0x7f154cd2e400
Forward algorithm: 7

1621437626467 ecm-clearml-compute-gpu-001:0 DEBUG Process failed, exit code 1

clearml-agent environment setup logs

The following below shows the terminal logging of the environment setup on execution of the experiment by the clearml-server.

1621437371563 ecm-clearml-compute-gpu-001:0 INFO task 1fc27ce7118542ea8845d8963a71bff7 pulled from b6da68d36d9944d89d05bc778bfcb196 by worker ecm-clearml-compute-gpu-001:0

1621437376755 ecm-clearml-compute-gpu-001:0 DEBUG Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.oieq9_34.cfg):
----------------------
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8008
api.web_server = http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8080
api.files_server = http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8081
api.credentials.access_key = A6NUAZ2MQ87B25LYQT89
api.host = http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8008
agent.worker_id = ecm-clearml-compute-gpu-001:0
agent.worker_name = ecm-clearml-compute-gpu-001
agent.force_git_ssh_protocol = false
agent.python_binary = 
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.torch_nightly = false
agent.venvs_dir = /home/edmorris/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/edmorris/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/edmorris/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /home/edmorris/.clearml/pip-cache
agent.docker_apt_cache = /home/edmorris/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04
agent.enable_task_env = false
agent.git_user = ecm200@gmail.com
agent.cuda_version = 111
agent.cudnn_version = 81
agent.default_python = 3.8
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key = 
sdk.aws.s3.region = 
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri = 
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Executing task id [1fc27ce7118542ea8845d8963a71bff7]:
repository = https://github.com/ecm200/caltech_birds.git
branch = clearml_integrations
version_num = b2e6741c9cc3c869fe36490ec39d165c5fba8c6c
tag = 
docker_cmd = None
entry_point = train_clearml_pytorch_ignite_caltech_birds.py
working_dir = scripts

created virtual environment CPython3.8.8.final.0-64 in 136ms
  creator CPython3Posix(dest=/home/edmorris/.clearml/venvs-builds/3.8, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/edmorris/.local/share/virtualenv)
    added seed packages: pip==21.1.1, setuptools==56.0.0, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator

Using cached repository in "/home/edmorris/.clearml/vcs-cache/caltech_birds.git.c0f811d2350a4cd7faa8a3d3d8453cca/caltech_birds.git"
Note: checking out 'b2e6741c9cc3c869fe36490ec39d165c5fba8c6c'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at b2e6741 More updates for ClearML execution
type: git
url: https://github.com/ecm200/caltech_birds.git
branch: HEAD
commit: b2e6741c9cc3c869fe36490ec39d165c5fba8c6c
root: /home/edmorris/.clearml/venvs-builds/3.8/task_repository/caltech_birds.git
Applying uncommitted changes
Executing: ('git', 'apply', '--unidiff-zero'): b'<stdin>:60: trailing whitespace.\n    \nwarning: 1 line adds whitespace errors.\n'

1621437381702 ecm-clearml-compute-gpu-001:0 DEBUG Collecting pip<20.2
  Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.1
    Uninstalling pip-21.1.1:
      Successfully uninstalled pip-21.1.1
Successfully installed pip-20.1.1
Collecting Cython
  Using cached Cython-0.29.23-cp38-cp38-manylinux1_x86_64.whl (1.9 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.23
Collecting numpy==1.20.2
  Using cached numpy-1.20.2-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB)
Installing collected packages: numpy

1621437386811 ecm-clearml-compute-gpu-001:0 DEBUG Successfully installed numpy-1.20.2
Torch CUDA 111 download page found
Found PyTorch version torch==1.8.1 matching CUDA version 111
Found PyTorch version torchvision==0.9.1 matching CUDA version 111
Collecting torch==1.8.1+cu111
  File was already downloaded /home/edmorris/.clearml/pip-download-cache/cu111/torch-1.8.1+cu111-cp38-cp38-linux_x86_64.whl
Successfully downloaded torch
Collecting torchvision==0.9.1+cu111
  File was already downloaded /home/edmorris/.clearml/pip-download-cache/cu111/torchvision-0.9.1+cu111-cp38-cp38-linux_x86_64.whl
Successfully downloaded torchvision
Collecting Pillow==8.2.0
  Using cached Pillow-8.2.0-cp38-cp38-manylinux1_x86_64.whl (3.0 MB)
Collecting captum==0.3.1
  Using cached captum-0.3.1-py3-none-any.whl (4.4 MB)
Processing /home/edmorris/.cache/pip/wheels/59/1b/52/0dea905f8278d5514dc4d0be5e251967f8681670cadd3dca89/imutils-0.5.4-py3-none-any.whl

1621437391868 ecm-clearml-compute-gpu-001:0 DEBUG Collecting matplotlib==3.3.4
  Using cached matplotlib-3.3.4-cp38-cp38-manylinux1_x86_64.whl (11.6 MB)
Requirement already satisfied: numpy==1.20.2 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from -r /tmp/cached-reqsypd2ylb1.txt (line 11)) (1.20.2)
Collecting pandas==1.2.4
  Using cached pandas-1.2.4-cp38-cp38-manylinux1_x86_64.whl (9.7 MB)
Collecting pytorch_ignite==0.4.4
  Using cached pytorch_ignite-0.4.4-py3-none-any.whl (200 kB)
Collecting pytorchcv==0.0.65
  Using cached pytorchcv-0.0.65-py2.py3-none-any.whl (527 kB)
Collecting scikit_image==0.18.1
  Using cached scikit_image-0.18.1-cp38-cp38-manylinux1_x86_64.whl (30.2 MB)
Collecting tensorboard==2.4.1
  Using cached tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
Processing /home/edmorris/.clearml/pip-download-cache/cu111/torch-1.8.1+cu111-cp38-cp38-linux_x86_64.whl

1621437422073 ecm-clearml-compute-gpu-001:0 DEBUG Collecting torch_lucent==0.1.7
  Using cached torch_lucent-0.1.7-py3-none-any.whl (46 kB)
Processing /home/edmorris/.clearml/pip-download-cache/cu111/torchvision-0.9.1+cu111-cp38-cp38-linux_x86_64.whl
Collecting yacs==0.1.8
  Using cached yacs-0.1.8-py3-none-any.whl (14 kB)
Collecting clearml==1.0.2
  Using cached clearml-1.0.2-py2.py3-none-any.whl (990 kB)
Collecting cycler>=0.10
  Using cached cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Collecting kiwisolver>=1.0.1
  Using cached kiwisolver-1.3.1-cp38-cp38-manylinux1_x86_64.whl (1.2 MB)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Collecting python-dateutil>=2.1
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pytz>=2017.3
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Collecting requests
  Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting imageio>=2.3.0
  Using cached imageio-2.9.0-py3-none-any.whl (3.3 MB)
Collecting tifffile>=2019.7.26
  Using cached tifffile-2021.4.8-py3-none-any.whl (165 kB)
Collecting scipy>=1.0.1
  Using cached scipy-1.6.3-cp38-cp38-manylinux1_x86_64.whl (27.2 MB)
Collecting networkx>=2.0
  Using cached networkx-2.5.1-py3-none-any.whl (1.6 MB)
Collecting PyWavelets>=1.1.1
  Using cached PyWavelets-1.1.1-cp38-cp38-manylinux1_x86_64.whl (4.4 MB)
Collecting google-auth<2,>=1.6.3
  Using cached google_auth-1.30.0-py2.py3-none-any.whl (146 kB)

1621437427113 ecm-clearml-compute-gpu-001:0 DEBUG Collecting google-auth-oauthlib<0.5,>=0.4.1
  Using cached google_auth_oauthlib-0.4.4-py2.py3-none-any.whl (18 kB)

1621437437254 ecm-clearml-compute-gpu-001:0 DEBUG Collecting protobuf>=3.6.0
  Using cached protobuf-3.17.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)

1621437442312 ecm-clearml-compute-gpu-001:0 DEBUG Collecting six>=1.10.0
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)

1621437457446 ecm-clearml-compute-gpu-001:0 DEBUG Collecting grpcio>=1.24.3
  Using cached grpcio-1.37.1-cp38-cp38-manylinux2014_x86_64.whl (4.2 MB)
Collecting tensorboard-plugin-wit>=1.6.0
  Using cached tensorboard_plugin_wit-1.8.0-py3-none-any.whl (781 kB)

1621437462533 ecm-clearml-compute-gpu-001:0 DEBUG Collecting werkzeug>=0.11.15
  Using cached Werkzeug-2.0.1-py3-none-any.whl (288 kB)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from tensorboard==2.4.1->-r /tmp/cached-reqsypd2ylb1.txt (line 16)) (0.36.2)
Requirement already satisfied: setuptools>=41.0.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from tensorboard==2.4.1->-r /tmp/cached-reqsypd2ylb1.txt (line 16)) (56.0.0)
Collecting markdown>=2.6.8
  Using cached Markdown-3.3.4-py3-none-any.whl (97 kB)
Collecting absl-py>=0.4
  Using cached absl_py-0.12.0-py3-none-any.whl (129 kB)
Collecting typing-extensions
  Using cached typing_extensions-3.10.0.0-py3-none-any.whl (26 kB)
Processing /home/edmorris/.cache/pip/wheels/8e/70/28/3d6ccd6e315f65f245da085482a2e1c7d14b90b30f239e2cf4/future-0.18.2-py3-none-any.whl
Collecting pytest
  Using cached pytest-6.2.4-py3-none-any.whl (280 kB)
Collecting scikit-learn
  Using cached scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl (24.9 MB)
Collecting coveralls
  Using cached coveralls-3.0.1-py2.py3-none-any.whl (13 kB)
Collecting decorator
  Using cached decorator-5.0.9-py3-none-any.whl (8.9 kB)
Collecting pytest-mock
  Using cached pytest_mock-3.6.1-py3-none-any.whl (12 kB)
Collecting coverage
  Using cached coverage-5.5-cp38-cp38-manylinux2010_x86_64.whl (245 kB)
Collecting kornia<=0.4.1
  Using cached kornia-0.4.1-py2.py3-none-any.whl (225 kB)
Collecting tqdm
  Using cached tqdm-4.60.0-py2.py3-none-any.whl (75 kB)
Collecting ipython
  Using cached ipython-7.23.1-py3-none-any.whl (785 kB)
Collecting PyYAML
  Using cached PyYAML-5.4.1-cp38-cp38-manylinux1_x86_64.whl (662 kB)
Collecting humanfriendly>=2.1
  Using cached humanfriendly-9.1-py2.py3-none-any.whl (86 kB)

1621437467650 ecm-clearml-compute-gpu-001:0 DEBUG Collecting psutil>=3.4.2
  Using cached psutil-5.8.0-cp38-cp38-manylinux2010_x86_64.whl (296 kB)
Collecting urllib3>=1.21.1
  Using cached urllib3-1.26.4-py2.py3-none-any.whl (153 kB)
Collecting furl>=2.0.0
  Using cached furl-2.1.2-py2.py3-none-any.whl (20 kB)
Collecting pathlib2>=2.3.0
  Using cached pathlib2-2.3.5-py2.py3-none-any.whl (18 kB)
Collecting jsonschema>=2.6.0
  Using cached jsonschema-3.2.0-py2.py3-none-any.whl (56 kB)
Collecting attrs>=18.0
  Using cached attrs-21.2.0-py2.py3-none-any.whl (53 kB)
Collecting pyjwt<3.0.0,>=1.6.4
  Using cached PyJWT-2.1.0-py3-none-any.whl (16 kB)
Collecting chardet<5,>=3.0.2
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.12.5-py2.py3-none-any.whl (147 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting rsa<5,>=3.1.4; python_version >= "3.6"
  Using cached rsa-4.7.2-py3-none-any.whl (34 kB)
Collecting pyasn1-modules>=0.2.1
  Using cached pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
Collecting cachetools<5.0,>=2.0.0
  Using cached cachetools-4.2.2-py3-none-any.whl (11 kB)
Collecting requests-oauthlib>=0.7.0
  Using cached requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting iniconfig
  Using cached iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting pluggy<1.0.0a1,>=0.12
  Using cached pluggy-0.13.1-py2.py3-none-any.whl (18 kB)
Collecting py>=1.8.2
  Using cached py-1.10.0-py2.py3-none-any.whl (97 kB)
Collecting toml
  Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting packaging
  Using cached packaging-20.9-py2.py3-none-any.whl (40 kB)
Collecting joblib>=0.11
  Using cached joblib-1.0.1-py3-none-any.whl (303 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Processing /home/edmorris/.cache/pip/wheels/56/ea/58/ead137b087d9e326852a851351d1debf4ada529b6ac0ec4e8c/docopt-0.6.2-py2.py3-none-any.whl
Collecting pygments
  Using cached Pygments-2.9.0-py3-none-any.whl (1.0 MB)
Collecting backcall
  Using cached backcall-0.2.0-py2.py3-none-any.whl (11 kB)
Collecting pickleshare
  Using cached pickleshare-0.7.5-py2.py3-none-any.whl (6.9 kB)
Collecting matplotlib-inline
  Using cached matplotlib_inline-0.1.2-py3-none-any.whl (8.2 kB)
Collecting jedi>=0.16
  Using cached jedi-0.18.0-py2.py3-none-any.whl (1.4 MB)
Collecting pexpect>4.3; sys_platform != "win32"
  Using cached pexpect-4.8.0-py2.py3-none-any.whl (59 kB)
Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0
  Using cached prompt_toolkit-3.0.18-py3-none-any.whl (367 kB)
Collecting traitlets>=4.2
  Using cached traitlets-5.0.5-py3-none-any.whl (100 kB)
Collecting orderedmultidict>=1.0.1
  Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB)
Processing /home/edmorris/.cache/pip/wheels/3d/22/08/7042eb6309c650c7b53615d5df5cc61f1ea9680e7edd3a08d2/pyrsistent-0.17.3-cp38-cp38-linux_x86_64.whl
Collecting pyasn1>=0.1.3
  Using cached pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
Collecting oauthlib>=3.0.0
  Using cached oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
Collecting parso<0.9.0,>=0.8.0
  Using cached parso-0.8.2-py2.py3-none-any.whl (94 kB)
Collecting ptyprocess>=0.5
  Using cached ptyprocess-0.7.0-py2.py3-none-any.whl (13 kB)
Collecting wcwidth
  Using cached wcwidth-0.2.5-py2.py3-none-any.whl (30 kB)
Collecting ipython-genutils
  Using cached ipython_genutils-0.2.0-py2.py3-none-any.whl (26 kB)
ERROR: networkx 2.5.1 has requirement decorator<5,>=4.3, but you'll have decorator 5.0.9 which is incompatible.
Installing collected packages: Pillow, six, cycler, kiwisolver, pyparsing, python-dateutil, matplotlib, typing-extensions, torch, captum, imutils, pytz, pandas, pytorch-ignite, urllib3, chardet, certifi, idna, requests, pytorchcv, imageio, tifffile, scipy, decorator, networkx, PyWavelets, scikit-image, pyasn1, rsa, pyasn1-modules, cachetools, google-auth, oauthlib, requests-oauthlib, google-auth-oauthlib, protobuf, grpcio, tensorboard-plugin-wit, werkzeug, markdown, absl-py, tensorboard, future, iniconfig, pluggy, py, toml, attrs, packaging, pytest, joblib, threadpoolctl, scikit-learn, docopt, coverage, coveralls, pytest-mock, kornia, tqdm, torchvision, pygments, backcall, pickleshare, ipython-genutils, traitlets, matplotlib-inline, parso, jedi, ptyprocess, pexpect, wcwidth, prompt-toolkit, ipython, torch-lucent, PyYAML, yacs, humanfriendly, psutil, orderedmultidict, furl, pathlib2, pyrsistent, jsonschema, pyjwt, clearml

1621437522917 ecm-clearml-compute-gpu-001:0 DEBUG Successfully installed Pillow-8.2.0 PyWavelets-1.1.1 PyYAML-5.4.1 absl-py-0.12.0 attrs-21.2.0 backcall-0.2.0 cachetools-4.2.2 captum-0.3.1 certifi-2020.12.5 chardet-4.0.0 clearml-1.0.2 coverage-5.5 coveralls-3.0.1 cycler-0.10.0 decorator-5.0.9 docopt-0.6.2 furl-2.1.2 future-0.18.2 google-auth-1.30.0 google-auth-oauthlib-0.4.4 grpcio-1.37.1 humanfriendly-9.1 idna-2.10 imageio-2.9.0 imutils-0.5.4 iniconfig-1.1.1 ipython-7.23.1 ipython-genutils-0.2.0 jedi-0.18.0 joblib-1.0.1 jsonschema-3.2.0 kiwisolver-1.3.1 kornia-0.4.1 markdown-3.3.4 matplotlib-3.3.4 matplotlib-inline-0.1.2 networkx-2.5.1 oauthlib-3.1.0 orderedmultidict-1.0.1 packaging-20.9 pandas-1.2.4 parso-0.8.2 pathlib2-2.3.5 pexpect-4.8.0 pickleshare-0.7.5 pluggy-0.13.1 prompt-toolkit-3.0.18 protobuf-3.17.0 psutil-5.8.0 ptyprocess-0.7.0 py-1.10.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pygments-2.9.0 pyjwt-2.1.0 pyparsing-2.4.7 pyrsistent-0.17.3 pytest-6.2.4 pytest-mock-3.6.1 python-dateutil-2.8.1 pytorch-ignite-0.4.4 pytorchcv-0.0.65 pytz-2021.1 requests-2.25.1 requests-oauthlib-1.3.0 rsa-4.7.2 scikit-image-0.18.1 scikit-learn-0.24.2 scipy-1.6.3 six-1.16.0 tensorboard-2.4.1 tensorboard-plugin-wit-1.8.0 threadpoolctl-2.1.0 tifffile-2021.4.8 toml-0.10.2 torch-1.8.1+cu111 torch-lucent-0.1.7 torchvision-0.9.1+cu111 tqdm-4.60.0 traitlets-5.0.5 typing-extensions-3.10.0.0 urllib3-1.26.4 wcwidth-0.2.5 werkzeug-2.0.1 yacs-0.1.8
Processing ./cub_tools
Requirement already satisfied: imutils in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (0.5.4)
Requirement already satisfied: pandas in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (1.2.4)
Requirement already satisfied: matplotlib in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (3.3.4)
Requirement already satisfied: numpy in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (1.20.2)
Requirement already satisfied: torch-lucent in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (0.1.7)
Requirement already satisfied: pytorchcv in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (0.0.65)
Requirement already satisfied: scikit-image in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (0.18.1)
Requirement already satisfied: Pillow in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from cub-tools==1.0.0) (8.2.0)
Requirement already satisfied: pytz>=2017.3 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pandas->cub-tools==1.0.0) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pandas->cub-tools==1.0.0) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from matplotlib->cub-tools==1.0.0) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from matplotlib->cub-tools==1.0.0) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from matplotlib->cub-tools==1.0.0) (1.3.1)
Requirement already satisfied: future in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (0.18.2)
Requirement already satisfied: coveralls in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (3.0.1)
Requirement already satisfied: coverage in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (5.5)
Requirement already satisfied: kornia<=0.4.1 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (0.4.1)
Requirement already satisfied: pytest in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (6.2.4)

1621437528026 ecm-clearml-compute-gpu-001:0 DEBUG Requirement already satisfied: pytest-mock in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (3.6.1)
Requirement already satisfied: scikit-learn in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (0.24.2)
Requirement already satisfied: torchvision in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (0.9.1+cu111)
Requirement already satisfied: decorator in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (5.0.9)
Requirement already satisfied: tqdm in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (4.60.0)
Requirement already satisfied: ipython in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (7.23.1)
Requirement already satisfied: torch>=1.5.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch-lucent->cub-tools==1.0.0) (1.8.1+cu111)
Requirement already satisfied: requests in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pytorchcv->cub-tools==1.0.0) (2.25.1)
Requirement already satisfied: imageio>=2.3.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from scikit-image->cub-tools==1.0.0) (2.9.0)
Requirement already satisfied: PyWavelets>=1.1.1 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from scikit-image->cub-tools==1.0.0) (1.1.1)
Requirement already satisfied: tifffile>=2019.7.26 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from scikit-image->cub-tools==1.0.0) (2021.4.8)
Requirement already satisfied: networkx>=2.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from scikit-image->cub-tools==1.0.0) (2.5.1)
Requirement already satisfied: scipy>=1.0.1 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from scikit-image->cub-tools==1.0.0) (1.6.3)
Requirement already satisfied: six>=1.5 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->cub-tools==1.0.0) (1.16.0)
Requirement already satisfied: docopt>=0.6.1 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from coveralls->torch-lucent->cub-tools==1.0.0) (0.6.2)
Requirement already satisfied: attrs>=19.2.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pytest->torch-lucent->cub-tools==1.0.0) (21.2.0)
Requirement already satisfied: toml in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pytest->torch-lucent->cub-tools==1.0.0) (0.10.2)
Requirement already satisfied: pluggy<1.0.0a1,>=0.12 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pytest->torch-lucent->cub-tools==1.0.0) (0.13.1)
Requirement already satisfied: iniconfig in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pytest->torch-lucent->cub-tools==1.0.0) (1.1.1)
Requirement already satisfied: py>=1.8.2 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pytest->torch-lucent->cub-tools==1.0.0) (1.10.0)
Requirement already satisfied: packaging in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pytest->torch-lucent->cub-tools==1.0.0) (20.9)
Requirement already satisfied: joblib>=0.11 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from scikit-learn->torch-lucent->cub-tools==1.0.0) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from scikit-learn->torch-lucent->cub-tools==1.0.0) (2.1.0)
Requirement already satisfied: pexpect>4.3; sys_platform != "win32" in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (4.8.0)
Requirement already satisfied: pickleshare in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (0.7.5)
Requirement already satisfied: matplotlib-inline in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (0.1.2)
Requirement already satisfied: pygments in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (2.9.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (3.0.18)
Requirement already satisfied: backcall in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (0.2.0)
Requirement already satisfied: traitlets>=4.2 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (5.0.5)
Requirement already satisfied: setuptools>=18.5 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (56.0.0)
Requirement already satisfied: jedi>=0.16 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from ipython->torch-lucent->cub-tools==1.0.0) (0.18.0)
Requirement already satisfied: typing-extensions in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch>=1.5.0->torch-lucent->cub-tools==1.0.0) (3.10.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from requests->pytorchcv->cub-tools==1.0.0) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from requests->pytorchcv->cub-tools==1.0.0) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from requests->pytorchcv->cub-tools==1.0.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from requests->pytorchcv->cub-tools==1.0.0) (1.26.4)
Requirement already satisfied: ptyprocess>=0.5 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from pexpect>4.3; sys_platform != "win32"->ipython->torch-lucent->cub-tools==1.0.0) (0.7.0)
Requirement already satisfied: wcwidth in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->torch-lucent->cub-tools==1.0.0) (0.2.5)
Requirement already satisfied: ipython-genutils in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from traitlets>=4.2->ipython->torch-lucent->cub-tools==1.0.0) (0.2.0)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from jedi>=0.16->ipython->torch-lucent->cub-tools==1.0.0) (0.8.2)
Building wheels for collected packages: cub-tools
  Building wheel for cub-tools (setup.py): started
  Building wheel for cub-tools (setup.py): finished with status 'done'
  Created wheel for cub-tools: filename=cub_tools-1.0.0-py3-none-any.whl size=23095 sha256=6b03842a7c470cab541b2ce3fba09ac1e45d0caf8c626e27ecc467c261809152
  Stored in directory: /tmp/pip-ephem-wheel-cache-jrjdo6km/wheels/57/1d/f4/bf063c8201c0f9be0ad6c491d4653eec58dbde40f0961fc339
Successfully built cub-tools
Installing collected packages: cub-tools
Successfully installed cub-tools-1.0.0
Replacing original pip vcs 'git+https://github.com/rwightman/pytorch-image-models.git' with 'git+https://ecm200%40gmail.com:xxxxxx@github.com/rwightman/pytorch-image-models.git'
Collecting git+https://ecm200%40gmail.com:****@github.com/rwightman/pytorch-image-models.git
  Cloning https://ecm200%40gmail.com:****@github.com/rwightman/pytorch-image-models.git to /tmp/pip-req-build-jcs909cs
  Running command git clone -q 'https://ecm200%40gmail.com:****@github.com/rwightman/pytorch-image-models.git' /tmp/pip-req-build-jcs909cs
Requirement already satisfied: torch>=1.4 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from timm==0.4.9) (1.8.1+cu111)
Requirement already satisfied: torchvision in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from timm==0.4.9) (0.9.1+cu111)
Requirement already satisfied: typing-extensions in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch>=1.4->timm==0.4.9) (3.10.0.0)
Requirement already satisfied: numpy in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torch>=1.4->timm==0.4.9) (1.20.2)
Requirement already satisfied: pillow>=4.1.1 in /home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (from torchvision->timm==0.4.9) (8.2.0)
Building wheels for collected packages: timm
  Building wheel for timm (setup.py): started

1621437535859 ecm-clearml-compute-gpu-001:0 DEBUG   Building wheel for timm (setup.py): finished with status 'done'
  Created wheel for timm: filename=timm-0.4.9-py3-none-any.whl size=346137 sha256=f487f72a8c6bcee4a149cfd73ed2ade3fd1b326d78d0b6874d867575c223fee9
  Stored in directory: /tmp/pip-ephem-wheel-cache-ev14df7u/wheels/91/92/6f/483882dce58f372566b25305bc184b4e8235968437833cff0d
Successfully built timm
Installing collected packages: timm

1621437546070 ecm-clearml-compute-gpu-001:0 DEBUG Successfully installed timm-0.4.9
Adding venv into cache: /home/edmorris/.clearml/venvs-builds/3.8
Running task id [1fc27ce7118542ea8845d8963a71bff7]:
[scripts]$ /home/edmorris/.clearml/venvs-builds/3.8/bin/python -u train_clearml_pytorch_ignite_caltech_birds.py
Summary - installed python packages:
pip:
- absl-py==0.12.0
- attrs==21.2.0
- backcall==0.2.0
- cachetools==4.2.2
- captum==0.3.1
- certifi==2020.12.5
- chardet==4.0.0
- clearml==1.0.2
- coverage==5.5
- coveralls==3.0.1
- cycler==0.10.0
- Cython==0.29.23
- decorator==5.0.9
- docopt==0.6.2
- furl==2.1.2
- future==0.18.2
- google-auth==1.30.0
- google-auth-oauthlib==0.4.4
- grpcio==1.37.1
- humanfriendly==9.1
- idna==2.10
- imageio==2.9.0
- imutils==0.5.4
- iniconfig==1.1.1
- ipython==7.23.1
- ipython-genutils==0.2.0
- jedi==0.18.0
- joblib==1.0.1
- jsonschema==3.2.0
- kiwisolver==1.3.1
- kornia==0.4.1
- Markdown==3.3.4
- matplotlib==3.3.4
- matplotlib-inline==0.1.2
- networkx==2.5.1
- numpy==1.20.2
- oauthlib==3.1.0
- orderedmultidict==1.0.1
- packaging==20.9
- pandas==1.2.4
- parso==0.8.2
- pathlib2==2.3.5
- pexpect==4.8.0
- pickleshare==0.7.5
- Pillow==8.2.0
- pluggy==0.13.1
- prompt-toolkit==3.0.18
- protobuf==3.17.0
- psutil==5.8.0
- ptyprocess==0.7.0
- py==1.10.0
- pyasn1==0.4.8
- pyasn1-modules==0.2.8
- Pygments==2.9.0
- PyJWT==2.1.0
- pyparsing==2.4.7
- pyrsistent==0.17.3
- pytest==6.2.4
- pytest-mock==3.6.1
- python-dateutil==2.8.1
- pytorch-ignite==0.4.4
- pytorchcv==0.0.65
- pytz==2021.1
- PyWavelets==1.1.1
- PyYAML==5.4.1
- requests==2.25.1
- requests-oauthlib==1.3.0
- rsa==4.7.2
- scikit-image==0.18.1
- scikit-learn==0.24.2
- scipy==1.6.3
- six==1.16.0
- tensorboard==2.4.1
- tensorboard-plugin-wit==1.8.0
- threadpoolctl==2.1.0
- tifffile==2021.4.8
- timm @ git+https://github.com/rwightman/pytorch-image-models.git@e7f0db866412b9ae61332c205270c9fc0ef5083c
- toml==0.10.2
- 'torch==1.8.1 # https://download.pytorch.org/whl/cu111/torch-1.8.1%2Bcu111-cp38-cp38-linux_x86_64.whl'
- torch-lucent==0.1.7
- 'torchvision==0.9.1 # https://download.pytorch.org/whl/cu111/torchvision-0.9.1%2Bcu111-cp38-cp38-linux_x86_64.whl'
- tqdm==4.60.0
- traitlets==5.0.5
- typing-extensions==3.10.0.0
- urllib3==1.26.4
- wcwidth==0.2.5
- Werkzeug==2.0.1
- yacs==0.1.8
- ./cub_tools

Environment setup completed successfully

Starting Task Execution:

usage: train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE]
                                                     [--opts ...]

PyTorch Image Classification Trainer - Ed Morris (c) 2021

optional arguments:
  -h, --help     show this help message and exit
  --config FILE  Path and name of configuration file for training. Should be a
                 .yaml file.
  --opts ...     Modify config options using the command-line 'KEY VALUE'
                 pairs

1621437551101 ecm-clearml-compute-gpu-001:0 DEBUG ClearML results page: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8080/projects/30034b3199e24123896c8eff9bf16d29/experiments/1fc27ce7118542ea8845d8963a71bff7/output/log
{'MODEL.MODEL_LIBRARY': 'torchvision', 'MODEL.MODEL_NAME': 'resnet34', 'MODEL.PRETRAINED': True, 'MODEL.WITH_AMP': False, 'MODEL.WITH_GRAD_SCALE': False, 'TRAIN.BATCH_SIZE': 16, 'TRAIN.NUM_WORKERS': 4, 'TRAIN.NUM_EPOCHS': 40, 'TRAIN.LOSS.CRITERION': 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE': 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov': True, 'TRAIN.SCHEDULER.TYPE': 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size': 7, 'TRAIN.SCHEDULER.PARAMS.gamma': 0.1, 'EARLY_STOPPING_PATIENCE': 5, 'DIRS.ROOT_DIR': '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR': 'models/classification', 'DIRS.CLEAN_UP': True, 'DATA.DATA_DIR': 'data/images', 'DATA.TRAIN_DIR': 'train', 'DATA.TEST_DIR': 'test', 'DATA.NUM_CLASSES': 200, 'DATA.TRANSFORMS.TYPE': 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size': 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize': 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type': 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale': 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range': (-10.0, 10.0), 'SYSTEM.LOG_HISTORY': True}
[INFO] Getting a local copy of the CUB200 birds datasets
[INFO] Default location of training dataset:: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8081
[INFO] Default location of training dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_0ccff21334e84b3d8e0618c5f1734cc8
[INFO] Default location of testing dataset:: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8081
[INFO] Default location of testing dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_b435c4ffda374bca83d9a746137dc3ca
[INFO] Task output destination:: None
[INFO] Final parameter list passed to Trainer object:: ['MODEL.MODEL_LIBRARY', 'torchvision', 'MODEL.MODEL_NAME', 'resnet34', 'MODEL.PRETRAINED', True, 'MODEL.WITH_AMP', False, 'MODEL.WITH_GRAD_SCALE', False, 'TRAIN.BATCH_SIZE', 16, 'TRAIN.NUM_WORKERS', 4, 'TRAIN.NUM_EPOCHS', 40, 'TRAIN.LOSS.CRITERION', 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE', 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr', 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum', 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov', True, 'TRAIN.SCHEDULER.TYPE', 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size', 7, 'TRAIN.SCHEDULER.PARAMS.gamma', 0.1, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS.ROOT_DIR', '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR', 'models/classification', 'DIRS.CLEAN_UP', True, 'DATA.DATA_DIR', 'data/images', 'DATA.TRAIN_DIR', 'train', 'DATA.TEST_DIR', 'test', 'DATA.NUM_CLASSES', 200, 'DATA.TRANSFORMS.TYPE', 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size', 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize', 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type', 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale', 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range', (-10.0, 10.0), 'SYSTEM.LOG_HISTORY', True, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/1fc27ce7118542ea8845d8963a71bff7']
[INFO] Parameters Override:: ['MODEL.MODEL_LIBRARY', 'torchvision', 'MODEL.MODEL_NAME', 'resnet34', 'MODEL.PRETRAINED', True, 'MODEL.WITH_AMP', False, 'MODEL.WITH_GRAD_SCALE', False, 'TRAIN.BATCH_SIZE', 16, 'TRAIN.NUM_WORKERS', 4, 'TRAIN.NUM_EPOCHS', 40, 'TRAIN.LOSS.CRITERION', 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE', 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr', 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum', 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov', True, 'TRAIN.SCHEDULER.TYPE', 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size', 7, 'TRAIN.SCHEDULER.PARAMS.gamma', 0.1, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS.ROOT_DIR', '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR', 'models/classification', 'DIRS.CLEAN_UP', True, 'DATA.DATA_DIR', 'data/images', 'DATA.TRAIN_DIR', 'train', 'DATA.TEST_DIR', 'test', 'DATA.NUM_CLASSES', 200, 'DATA.TRANSFORMS.TYPE', 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size', 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize', 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type', 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale', 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range', (-10.0, 10.0), 'SYSTEM.LOG_HISTORY', True, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/1fc27ce7118542ea8845d8963a71bff7']
DATA:
  DATA_DIR: /home/edmorris/.clearml/cache/storage_manager/datasets
  NUM_CLASSES: 200
  TEST_DIR: ds_b435c4ffda374bca83d9a746137dc3ca
  TRAIN_DIR: ds_0ccff21334e84b3d8e0618c5f1734cc8
  TRANSFORMS:
    PARAMS:
      AGGRESIVE:
        persp_distortion_scale: 0.25
        rotation_range: (-10.0, 10.0)
        type: all
      DEFAULT:
        img_crop_size: 224
        img_resize: 256
    TYPE: default
DIRS:
  CLEAN_UP: True
  ROOT_DIR: 
  WORKING_DIR: /home/edmorris/.clearml/cache/1fc27ce7118542ea8845d8963a71bff7/ignite_resnet34
EARLY_STOPPING_PATIENCE: 5
MODEL:
  MODEL_LIBRARY: torchvision
  MODEL_NAME: resnet34
  PRETRAINED: True
  WITH_AMP: False
  WITH_GRAD_SCALE: False
SYSTEM:
  LOG_HISTORY: True
TRAIN:
  BATCH_SIZE: 16
  LOSS:
    CRITERION: CrossEntropy
  NUM_EPOCHS: 40
  NUM_WORKERS: 4
  OPTIMIZER:
    PARAMS:
      lr: 0.001
      momentum: 0.9
      nesterov: True
    TYPE: SGD
  SCHEDULER:
    PARAMS:
      gamma: 0.1
      step_size: 7
    TYPE: StepLR
[INFO] Creating data transforms...
[INFO] Creating data loaders...
***********************************************
**            DATASET SUMMARY                **
***********************************************
train  size::  5994  images
test  size::  5794  images
Number of classes::  200
***********************************************
[INFO] Created data loaders.
[INFO] Creating the model...
2021-05-19 15:19:07,332 - clearml.model - INFO - Selected model id: 8df52efca2684e5f8b727fa928623a82
[INFO] Successfully created model and pushed it to the device cuda:0
[INFO] Creating optimizer...
[INFO] Successfully created optimizer object.
[INFO] Successfully created learning rate scheduler object.
[INFO] Trainer pass OK for training.
Tensorboard Logging...done
[INFO] Creating callback functions for training loop...Early Stopping (5 epochs)...Model Checkpointing...Done
[INFO] Executing model training...

1621437621593 ecm-clearml-compute-gpu-001:0 DEBUG Epoch: 0001  TrAcc: 0.296 ValAcc: 0.005 TrPrec: 0.393 ValPrec: 0.000 TrRec: 0.296 ValRec: 0.005 TrF1: 0.262 ValF1: 0.000 TrTopK: 0.613 ValTopK: 0.026 TrLoss: 3.506 ValLoss: 5.299
Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

PyTorch Ignite training script

Below is the trainer script, which has been modified to run with clearml. A version of this script without the clearml interface has successfully trained these models on the compute server in a conda environment.

from __future__ import print_function, division
import os, pathlib

# Clear ML experiment
from clearml import Task, StorageManager, Dataset

# Local modules
from cub_tools.trainer import Ignite_Trainer
from cub_tools.args import get_parser
from cub_tools.config import get_cfg_defaults

# Get the arguments from the command line, including configuration file and any overrides.
parser = get_parser()
parser.print_help()
args = parser.parse_args()

#print('[INFO] Optional Arguments from CLI:: {}'.format(args.opts))
#if args.opts == '[]':
#    args.opts = list()
#    print('[INFO] Setting empty CLI args to an explicit empty list')

## CLEAR ML
# Tmp config load for network name
cfg = get_cfg_defaults()
cfg.merge_from_file(args.config)
# Connecting with the ClearML process
task = Task.init(project_name='Caltech Birds', task_name='Train PyTorch CNN on CUB200 using Ignite [Library: '+cfg.MODEL.MODEL_LIBRARY+', Network: '+cfg.MODEL.MODEL_NAME+']', task_type=Task.TaskTypes.training)
# Add the local python package as a requirement
task.add_requirements('./cub_tools')
task.add_requirements('git+https://github.com/rwightman/pytorch-image-models.git')
# Setup ability to add configuration parameters control.
params = {'TRAIN.NUM_EPOCHS': 20, 'TRAIN.BATCH_SIZE': 32, 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9}
params = task.connect(params)  # enabling configuration override by clearml
print(params)  # printing actual configuration (after override in remote mode)
# Convert Params dictionary into a set of key value pairs in a list
params_list = []
for key in params:
    params_list.extend([key,params[key]])

# Execute task remotely
task.execute_remotely()

# Get the dataset from the clearml-server and cache locally.
print('[INFO] Getting a local copy of the CUB200 birds datasets')
# Train
train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
print('[INFO] Default location of training dataset:: {}'.format(train_dataset.get_default_storage()))
train_dataset_base = train_dataset.get_local_copy()
print('[INFO] Default location of training dataset:: {}'.format(train_dataset_base))

# Test
test_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_test_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset.get_default_storage()))
test_dataset_base = test_dataset.get_local_copy()
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset_base))

# Amend the input data directories and output directories for remote execution
# Modify experiment root dir
params_list = params_list + ['DIRS.ROOT_DIR', '']
# Add data root dir
params_list = params_list + ['DATA.DATA_DIR', str(pathlib.PurePath(train_dataset_base).parent)]
# Add data train dir
params_list = params_list + ['DATA.TRAIN_DIR', str(pathlib.PurePath(train_dataset_base).name)]
# Add data test dir
params_list = params_list + ['DATA.TEST_DIR', str(pathlib.PurePath(test_dataset_base).name)]
# Add working dir
params_list = params_list + ['DIRS.WORKING_DIR', str(task.cache_dir)]
print('[INFO] Task output destination:: {}'.format(task.get_output_destination()))

print('[INFO] Final parameter list passed to Trainer object:: {}'.format(params_list))

# Create the trainer object
trainer = Ignite_Trainer(config=args.config, cmd_args=params_list) # NOTE: disabled cmd line argument passing but using it to pass ClearML configs.

# Setup the data transformers
print('[INFO] Creating data transforms...')
trainer.create_datatransforms()

# Setup the dataloaders
print('[INFO] Creating data loaders...')
trainer.create_dataloaders()

# Setup the model
print('[INFO] Creating the model...')
trainer.create_model()

# Setup the optimizer
print('[INFO] Creating optimizer...')
trainer.create_optimizer()

# Setup the scheduler
trainer.create_scheduler()

# Train the model
trainer.run()

bmartinn commented 3 years ago

Hi @ecm200 This cuDNN error: CUDNN_STATUS_EXECUTION_FAILED seems like a CUDA / CUDNN mismatch issue. It seems from the log clearml-agent installed the correct pytorch version (based on the auto detected CUDA 11.1 version). Is this the same setup that worked on your development machine ? (Basically I suspect this is not a direct issue of clearml but a cuda/pytorch thing)

BTW: Running the clearml-agent would solve such issues, as you will have the ability to launch the code inside a container with the correct CUDA support.

ecm200 commented 3 years ago

Hi @bmartinn,

I think you're right, however I think this boils down to how you build environments on the remote machine.

When I set up my development environments on remote machines to work directly on them (i.e. not with ClearML) I tend to default to Conda for most packages and then use PIP when packages are not available on CondaCloud.

Local execution of venv created by clearml-agent

I have gone onto the remote compute resource, and executed the code locally, using the virtual environment created by the clearml-agent, and I get the same error.

Local execution in conda environment

I have also created a new conda environment on the same machine, using mostly conda to install the package dependencies as described above, and executed the exact same code as I did before, but it is now computing fine and iterating and logging as expected into clearml-server.

(py38_pytorch18) edmorris@ecm-clearml-compute-gpu-001:~/.clearml/venvs-builds/3.8/task_repository/caltech_birds.git/scripts$ python local_train_clearml_pytorch_ignite_caltech_birds.py --config configs/torchvision/resnet34_config.yaml
usage: local_train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE] [--opts ...]

PyTorch Image Classification Trainer - Ed Morris (c) 2021

optional arguments:
  -h, --help     show this help message and exit
  --config FILE  Path and name of configuration file for training. Should be a .yaml file.
  --opts ...     Modify config options using the command-line 'KEY VALUE' pairs
ClearML Task: overwriting (reusing) task id=ea8903a29bf443d5ab469f9c56c2a8b5
ClearML results page: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8080/projects/30034b3199e24123896c8eff9bf16d29/experiments/ea8903a29bf443d5ab469f9c56c2a8b5/output/log
2021-05-20 09:29:55,856 - clearml.task - WARNING - Requirement ignored, Task.add_requirements() must be called before Task.init()
2021-05-20 09:29:55,861 - clearml.task - WARNING - Requirement ignored, Task.add_requirements() must be called before Task.init()
{'MODEL.MODEL_LIBRARY': 'torchvision', 'MODEL.MODEL_NAME': 'resnet34', 'MODEL.PRETRAINED': True, 'MODEL.WITH_AMP': False, 'MODEL.WITH_GRAD_SCALE': False, 'TRAIN.BATCH_SIZE': 16, 'TRAIN.NUM_WORKERS': 4, 'TRAIN.NUM_EPOCHS': 40, 'TRAIN.LOSS.CRITERION': 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE': 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov': True, 'TRAIN.SCHEDULER.TYPE': 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size': 7, 'TRAIN.SCHEDULER.PARAMS.gamma': 0.1, 'EARLY_STOPPING_PATIENCE': 5, 'DIRS.ROOT_DIR': '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR': 'models/classification', 'DIRS.CLEAN_UP': True, 'DATA.DATA_DIR': 'data/images', 'DATA.TRAIN_DIR': 'train', 'DATA.TEST_DIR': 'test', 'DATA.NUM_CLASSES': 200, 'DATA.TRANSFORMS.TYPE': 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size': 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize': 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type': 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale': 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range': (-10.0, 10.0), 'SYSTEM.LOG_HISTORY': True}
[INFO] Getting a local copy of the CUB200 birds datasets
[INFO] Default location of training dataset:: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8081
[INFO] Default location of training dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_0ccff21334e84b3d8e0618c5f1734cc8
[INFO] Default location of testing dataset:: http://ecm-clearml-server-001.westeurope.cloudapp.azure.com:8081
[INFO] Default location of testing dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_b435c4ffda374bca83d9a746137dc3ca
[INFO] Task output destination::
[INFO] Final parameter list passed to Trainer object:: ['MODEL.MODEL_LIBRARY', 'torchvision', 'MODEL.MODEL_NAME', 'resnet34', 'MODEL.PRETRAINED', True, 'MODEL.WITH_AMP', False, 'MODEL.WITH_GRAD_SCALE', False, 'TRAIN.BATCH_SIZE', 16, 'TRAIN.NUM_WORKERS', 4, 'TRAIN.NUM_EPOCHS', 40, 'TRAIN.LOSS.CRITERION', 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE', 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr', 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum', 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov', True, 'TRAIN.SCHEDULER.TYPE', 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size', 7, 'TRAIN.SCHEDULER.PARAMS.gamma', 0.1, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS.ROOT_DIR', '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR', 'models/classification', 'DIRS.CLEAN_UP', True, 'DATA.DATA_DIR', 'data/images', 'DATA.TRAIN_DIR', 'train', 'DATA.TEST_DIR', 'test', 'DATA.NUM_CLASSES', 200, 'DATA.TRANSFORMS.TYPE', 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size', 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize', 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type', 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale', 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range', (-10.0, 10.0), 'SYSTEM.LOG_HISTORY', True, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/ea8903a29bf443d5ab469f9c56c2a8b5']
[INFO] Parameters Override:: ['MODEL.MODEL_LIBRARY', 'torchvision', 'MODEL.MODEL_NAME', 'resnet34', 'MODEL.PRETRAINED', True, 'MODEL.WITH_AMP', False, 'MODEL.WITH_GRAD_SCALE', False, 'TRAIN.BATCH_SIZE', 16, 'TRAIN.NUM_WORKERS', 4, 'TRAIN.NUM_EPOCHS', 40, 'TRAIN.LOSS.CRITERION', 'CrossEntropy', 'TRAIN.OPTIMIZER.TYPE', 'SGD', 'TRAIN.OPTIMIZER.PARAMS.lr', 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum', 0.9, 'TRAIN.OPTIMIZER.PARAMS.nesterov', True, 'TRAIN.SCHEDULER.TYPE', 'StepLR', 'TRAIN.SCHEDULER.PARAMS.step_size', 7, 'TRAIN.SCHEDULER.PARAMS.gamma', 0.1, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS.ROOT_DIR', '/home/edmorris/projects/image_classification/caltech_birds', 'DIRS.WORKING_DIR', 'models/classification', 'DIRS.CLEAN_UP', True, 'DATA.DATA_DIR', 'data/images', 'DATA.TRAIN_DIR', 'train', 'DATA.TEST_DIR', 'test', 'DATA.NUM_CLASSES', 200, 'DATA.TRANSFORMS.TYPE', 'default', 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_crop_size', 224, 'DATA.TRANSFORMS.PARAMS.DEFAULT.img_resize', 256, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.type', 'all', 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.persp_distortion_scale', 0.25, 'DATA.TRANSFORMS.PARAMS.AGGRESIVE.rotation_range', (-10.0, 10.0), 'SYSTEM.LOG_HISTORY', True, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/ea8903a29bf443d5ab469f9c56c2a8b5']
DATA:
  DATA_DIR: /home/edmorris/.clearml/cache/storage_manager/datasets
  NUM_CLASSES: 200
  TEST_DIR: ds_b435c4ffda374bca83d9a746137dc3ca
  TRAIN_DIR: ds_0ccff21334e84b3d8e0618c5f1734cc8
  TRANSFORMS:
    PARAMS:
      AGGRESIVE:
        persp_distortion_scale: 0.25
        rotation_range: (-10.0, 10.0)
        type: all
      DEFAULT:
        img_crop_size: 224
        img_resize: 256
    TYPE: default
DIRS:
  CLEAN_UP: True
  ROOT_DIR:
  WORKING_DIR: /home/edmorris/.clearml/cache/ea8903a29bf443d5ab469f9c56c2a8b5/ignite_resnet34
EARLY_STOPPING_PATIENCE: 5
MODEL:
  MODEL_LIBRARY: torchvision
  MODEL_NAME: resnet34
  PRETRAINED: True
  WITH_AMP: False
  WITH_GRAD_SCALE: False
SYSTEM:
  LOG_HISTORY: True
TRAIN:
  BATCH_SIZE: 16
  LOSS:
    CRITERION: CrossEntropy
  NUM_EPOCHS: 40
  NUM_WORKERS: 4
  OPTIMIZER:
    PARAMS:
      lr: 0.001
      momentum: 0.9
      nesterov: True
    TYPE: SGD
  SCHEDULER:
    PARAMS:
      gamma: 0.1
      step_size: 7
    TYPE: StepLR
[INFO] Creating data transforms...
[INFO] Creating data loaders...
***********************************************
**            DATASET SUMMARY                **
***********************************************
train  size::  5994  images
test  size::  5794  images
Number of classes::  200
***********************************************
[INFO] Created data loaders.
[INFO] Creating the model...
2021-05-20 09:29:57,271 - clearml.model - INFO - Selected model id: 8df52efca2684e5f8b727fa928623a82
[INFO] Successfully created model and pushed it to the device cuda:0
[INFO] Creating optimizer...
[INFO] Successfully created optimizer object.
[INFO] Successfully created learning rate scheduler object.
[INFO] Trainer pass OK for training.
Tensorboard Logging...done
[INFO] Creating callback functions for training loop...Early Stopping (5 epochs)...Model Checkpointing...Done
[INFO] Executing model training...
Epoch: 0001  TrAcc: 0.301 ValAcc: 0.298 TrPrec: 0.406 ValPrec: 0.380 TrRec: 0.301 ValRec: 0.302 TrF1: 0.267 ValF1: 0.258 TrTopK: 0.614 ValTopK: 0.647 TrLoss: 3.509 ValLoss: 3.293
Epoch: 0002  TrAcc: 0.480 ValAcc: 0.492 TrPrec: 0.573 ValPrec: 0.580 TrRec: 0.480 ValRec: 0.496 TrF1: 0.461 ValF1: 0.465 TrTopK: 0.786 ValTopK: 0.832 TrLoss: 2.349 ValLoss: 2.117
Epoch: 0003  TrAcc: 0.593 ValAcc: 0.606 TrPrec: 0.652 ValPrec: 0.656 TrRec: 0.593 ValRec: 0.609 TrF1: 0.583 ValF1: 0.590 TrTopK: 0.848 ValTopK: 0.889 TrLoss: 1.819 ValLoss: 1.569
Epoch: 0004  TrAcc: 0.661 ValAcc: 0.659 TrPrec: 0.705 ValPrec: 0.695 TrRec: 0.661 ValRec: 0.661 TrF1: 0.656 ValF1: 0.651 TrTopK: 0.869 ValTopK: 0.907 TrLoss: 1.493 ValLoss: 1.315
Epoch: 0005  TrAcc: 0.693 ValAcc: 0.688 TrPrec: 0.742 ValPrec: 0.726 TrRec: 0.693 ValRec: 0.692 TrF1: 0.692 ValF1: 0.682 TrTopK: 0.894 ValTopK: 0.926 TrLoss: 1.290 ValLoss: 1.173
Epoch: 0006  TrAcc: 0.726 ValAcc: 0.719 TrPrec: 0.763 ValPrec: 0.740 TrRec: 0.726 ValRec: 0.720 TrF1: 0.726 ValF1: 0.712 TrTopK: 0.901 ValTopK: 0.931 TrLoss: 1.176 ValLoss: 1.054
Epoch: 0007  TrAcc: 0.746 ValAcc: 0.728 TrPrec: 0.779 ValPrec: 0.743 TrRec: 0.746 ValRec: 0.730 TrF1: 0.747 ValF1: 0.722 TrTopK: 0.908 ValTopK: 0.938 TrLoss: 1.068 ValLoss: 0.975
Epoch: 0008  TrAcc: 0.781 ValAcc: 0.763 TrPrec: 0.796 ValPrec: 0.769 TrRec: 0.781 ValRec: 0.766 TrF1: 0.782 ValF1: 0.760 TrTopK: 0.924 ValTopK: 0.946 TrLoss: 0.944 ValLoss: 0.888
Epoch: 0009  TrAcc: 0.784 ValAcc: 0.770 TrPrec: 0.798 ValPrec: 0.774 TrRec: 0.784 ValRec: 0.772 TrF1: 0.786 ValF1: 0.768 TrTopK: 0.923 ValTopK: 0.948 TrLoss: 0.918 ValLoss: 0.864

So it does look like here that it is something to do with the PyTorch installation using PIP, as this is difference between the clearml-agent derived environment and the manually created conda environment I have created.

Questions about environment creation

So I am wondering, if you have an issue like this, where you have a package selection that is mainly in say Conda, but there are a few packages in PIP, then how can this be handled?

I could create a YAML file of the Conda environment that is now successfully running the code, and that could be used to create a conda environment, but how can this be used in conjunction with clearml to do that automatically when an experiment is cloned and executed?

ecm200 commented 3 years ago

@bmartinn

Update, I created another environment manually on the compute server, using CONDA to create the environment object, but then I installed all packages, including PyTorch using PIP and NOT CONDA. I made sure the versions matched those picked up by the dependency map created by clearml.

Executing the code in this environment caused the same issue as the clearml-agent created environment, as these both installed PyTorch using PIP.

So is there anyway to use both CONDA and PIP, like I do when creating environments manually, so to install most from CONDA and what isn't available using PIP?

The YAML file created by a CONDA environment creates a package list that differentiates between package sources, either CONDA or PIP. The call is as follows, and should be run inside the environment you want to get the details of:

conda env export > environment_specs.yml

This results in a YAML file as follows:

name: py38_pytorch18
channels:
  - pytorch
  - nvidia
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - absl-py=0.12.0=py38h06a4308_0
  - aiohttp=3.6.3=py38h7b6447c_0
  - async-timeout=3.0.1=py38h06a4308_0
  - attrs=21.2.0=pyhd3eb1b0_0
  - blas=1.0=mkl
  - blinker=1.4=py38h06a4308_0
  - blosc=1.19.0=hd408876_0
  - brotli=1.0.9=he6710b0_2
  - brotlipy=0.7.0=py38h27cfd23_1003
  - bzip2=1.0.8=h7b6447c_0
  - c-ares=1.17.1=h27cfd23_0
  - ca-certificates=2021.4.13=h06a4308_1
  - cachetools=4.2.2=pyhd3eb1b0_0
  - certifi=2020.12.5=py38h06a4308_0
  - cffi=1.14.5=py38h261ae71_0
  - chardet=3.0.4=py38h06a4308_1003
  - charls=2.1.0=he6710b0_2
  - click=8.0.0=pyhd3eb1b0_0
  - cloudpickle=1.6.0=py_0
  - coverage=5.5=py38h27cfd23_2
  - cryptography=3.4.7=py38hd23ed53_0
  - cudatoolkit=11.1.74=h6bb024c_0
  - cycler=0.10.0=py38_0
  - cython=0.29.23=py38h2531618_0
  - cytoolz=0.11.0=py38h7b6447c_0
  - dask-core=2021.5.0=pyhd3eb1b0_0
  - dbus=1.13.18=hb2f20db_0
  - decorator=5.0.9=pyhd3eb1b0_0
  - expat=2.3.0=h2531618_2
  - ffmpeg=4.3=hf484d3e_0
  - fontconfig=2.13.1=h6c09931_0
  - freetype=2.10.4=h5ab3b9f_0
  - fsspec=0.9.0=pyhd3eb1b0_0
  - giflib=5.1.4=h14c3975_1
  - glib=2.68.2=h36276a3_0
  - gmp=6.2.1=h2531618_2
  - gnutls=3.6.15=he1e5248_0
  - google-auth=1.30.0=pyhd3eb1b0_0
  - google-auth-oauthlib=0.4.4=pyhd3eb1b0_0
  - grpcio=1.36.1=py38h2157cd5_1
  - gst-plugins-base=1.14.0=h8213a91_2
  - gstreamer=1.14.0=h28cd5cc_2
  - icu=58.2=he6710b0_3
  - idna=2.10=pyhd3eb1b0_0
  - ignite=0.4.4=py_0
  - imagecodecs=2020.5.30=py38h567f118_1
  - imageio=2.9.0=pyhd3eb1b0_0
  - importlib-metadata=3.10.0=py38h06a4308_0
  - intel-openmp=2021.2.0=h06a4308_610
  - joblib=1.0.1=pyhd3eb1b0_0
  - jpeg=9b=h024ee3a_2
  - jxrlib=1.1=h7b6447c_2
  - kiwisolver=1.3.1=py38h2531618_0
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libaec=1.0.4=he6710b0_1
  - libffi=3.3=he6710b0_2
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libiconv=1.15=h63c8f33_5
  - libidn2=2.3.1=h27cfd23_0
  - libpng=1.6.37=hbc83047_0
  - libprotobuf=3.14.0=h8c45485_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libtasn1=4.16.0=h27cfd23_0
  - libtiff=4.1.0=h2733197_1
  - libunistring=0.9.10=h27cfd23_0
  - libuuid=1.0.3=h1bed415_2
  - libuv=1.40.0=h7b6447c_0
  - libwebp=1.0.1=h8e7db2f_0
  - libxcb=1.14=h7b6447c_0
  - libxml2=2.9.10=hb55368b_3
  - libzopfli=1.0.3=he6710b0_0
  - locket=0.2.1=py38h06a4308_1
  - lz4-c=1.9.3=h2531618_0
  - markdown=3.3.4=py38h06a4308_0
  - matplotlib=3.3.4=py38h06a4308_0
  - matplotlib-base=3.3.4=py38h62a2d02_0
  - mkl=2021.2.0=h06a4308_296
  - mkl-service=2.3.0=py38h27cfd23_1
  - mkl_fft=1.3.0=py38h42c9631_2
  - mkl_random=1.2.1=py38ha9443f7_2
  - multidict=4.7.6=py38h7b6447c_1
  - ncurses=6.2=he6710b0_1
  - nettle=3.7.2=hbbd107a_1
  - networkx=2.5=py_0
  - ninja=1.10.2=hff7bd54_1
  - numpy=1.20.2=py38h2d18471_0
  - numpy-base=1.20.2=py38hfae3a4d_0
  - oauthlib=3.1.0=py_0
  - olefile=0.46=py_0
  - openh264=2.1.0=hd408876_0
  - openjpeg=2.3.0=h05c96fa_1
  - openssl=1.1.1k=h27cfd23_0
  - pandas=1.2.4=py38h2531618_0
  - partd=1.2.0=pyhd3eb1b0_0
  - pcre=8.44=he6710b0_0
  - pillow=8.2.0=py38he98fc37_0
  - pip=21.0.1=py38h06a4308_0
  - protobuf=3.14.0=py38h2531618_1
  - pyasn1=0.4.8=py_0
  - pyasn1-modules=0.2.8=py_0
  - pycparser=2.20=py_2
  - pyjwt=1.7.1=py38_0
  - pyopenssl=20.0.1=pyhd3eb1b0_1
  - pyparsing=2.4.7=pyhd3eb1b0_0
  - pyqt=5.9.2=py38h05f1152_4
  - pysocks=1.7.1=py38h06a4308_0
  - python=3.8.10=hdb3f193_7
  - python-dateutil=2.8.1=pyhd3eb1b0_0
  - pytorch=1.8.1=py3.8_cuda11.1_cudnn8.0.5_0
  - pytz=2021.1=pyhd3eb1b0_0
  - pywavelets=1.1.1=py38h7b6447c_2
  - pyyaml=5.4.1=py38h27cfd23_1
  - qt=5.9.7=h5867ecd_1
  - readline=8.1=h27cfd23_0
  - requests=2.25.1=pyhd3eb1b0_0
  - requests-oauthlib=1.3.0=py_0
  - rsa=4.7.2=pyhd3eb1b0_1
  - scikit-image=0.18.1=py38ha9443f7_0
  - scikit-learn=0.24.2=py38ha9443f7_0
  - scipy=1.6.2=py38had2a1c9_1
  - setuptools=52.0.0=py38h06a4308_0
  - sip=4.19.13=py38he6710b0_0
  - six=1.15.0=py38h06a4308_0
  - snappy=1.1.8=he6710b0_0
  - sqlite=3.35.4=hdfb4753_0
  - tensorboard=2.4.0=pyhc547734_0
  - tensorboard-plugin-wit=1.6.0=py_0
  - threadpoolctl=2.1.0=pyh5ca1d4c_0
  - tifffile=2021.3.31=pyhd3eb1b0_1
  - tk=8.6.10=hbc83047_0
  - toolz=0.11.1=pyhd3eb1b0_0
  - torchaudio=0.8.1=py38
  - torchvision=0.9.1=py38_cu111
  - tornado=6.1=py38h27cfd23_0
  - typing_extensions=3.7.4.3=pyha847dfd_0
  - urllib3=1.26.4=pyhd3eb1b0_0
  - werkzeug=1.0.1=pyhd3eb1b0_0
  - wheel=0.36.2=pyhd3eb1b0_0
  - xz=5.2.5=h7b6447c_0
  - yaml=0.2.5=h7b6447c_0
  - yarl=1.6.3=py38h27cfd23_0
  - zipp=3.4.1=pyhd3eb1b0_0
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.4.9=haebb681_0
  - pip:
    - backcall==0.2.0
    - clearml==1.0.2
    - coveralls==3.0.1
    - cub-tools==1.0.0
    - docopt==0.6.2
    - furl==2.1.2
    - humanfriendly==9.1
    - imutils==0.5.4
    - iniconfig==1.1.1
    - ipython==7.23.1
    - ipython-genutils==0.2.0
    - jedi==0.18.0
    - jsonschema==3.2.0
    - kornia==0.4.1
    - matplotlib-inline==0.1.2
    - orderedmultidict==1.0.1
    - packaging==20.9
    - parso==0.8.2
    - pathlib2==2.3.5
    - pexpect==4.8.0
    - pickleshare==0.7.5
    - pluggy==0.13.1
    - prompt-toolkit==3.0.18
    - ptyprocess==0.7.0
    - py==1.10.0
    - pygments==2.9.0
    - pyrsistent==0.17.3
    - pytest==6.2.4
    - pytest-mock==3.6.1
    - pytorchcv==0.0.65
    - timm==0.4.9
    - toml==0.10.2
    - torch-lucent==0.1.8
    - tqdm==4.60.0
    - traitlets==5.0.5
    - wcwidth==0.2.5
    - yacs==0.1.8
prefix: /home/edmorris/.conda/envs/py38_pytorch18

ecm200 commented 3 years ago

@bmartinn

Confirmed, this is an issue with the PyTorch installation using PIP as the package manager. It's been open 27 days and it doesn't look like there has been a resolution other than to use CONDA to install PyTorch into a virtual environment.

PyTorch Issue 56747

ecm200 commented 3 years ago

@bmartinn,

So my question here is how best to control environment creation on the remote compute end by clearml-agents, when there is a combined requirement for using both Conda and PIP.

ecm200 commented 3 years ago

@bmartinn

Thanks for all your tips and help.

After a lot experimenting with various clearml-agent options, including running in docker mode, I started again using the conda package manager as the virtual environment creator on the remote compute node. This time, I was able to see that the clearml-agent was able to use PIP to install additional packages, if they could not be resolved using CONDA. This means that PyTorch was installed using the recommended CONDA method and therefore circumented the issues found with using PIP to install PyTorch, as detailed above.

This has led to a successful creation of a training environment on the remote compute and a successful training of PyTorch models.

bmartinn commented 3 years ago

Great work!

BTW: If you are running the local code with conda, you can set the agent to use conda as well [see here] . Notice that if you are running locally with pip, the agent's conda environement will use pip to install the packages to avoid version mismatch.

allegroai / clearml-agent