Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.07k stars 2.52k forks source link

PyTorch on Azure ML: problem accessing GPU with ACPT environment image #1917

Open movingabout opened 1 year ago

movingabout commented 1 year ago

In Azure Machine Learning, I am trying to set up a PyTorch 2.0.1 run on a curated ACPT image: mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:latest (based on this and this documentation)

I use the image above on a GPU cluster with size STANDARD_NC6and this conda environment yaml:

name: compass-environment-simple
channels:
  - anaconda
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - python==3.10.9
  - pip
  - pytorch==2.0.1
  - torchvision
  - torchaudio  
  - pytorch-cuda=11.7
  - cudatoolkit=11.7
  - pip:
      - pandas==2.0.1
      - numpy==1.24.3
      - azure-ai-ml==1.7.2
      - azureml-mlflow==1.51.0
      - mlflow==2.3.2
      - tensorboard==2.13.0
      - tensorboardX==2.6
      - matplotlib==3.7.1
      - matplotlib-inline==0.1.6
      - mizani==0.9.1
      - plotnine==0.12.1
      - seaborn==0.12.2
      - scikit-learn==1.2.2
      - scipy==1.10.1
      - numpy==1.24.3
      - plotly==5.14.1
      - statsmodels==0.14.0
      - ipython==8.14.0
      - tqdm==4.65.0
      - umap-learn==0.5.3  
      - imbalanced-learn==0.10.1
      - gensim==4.3.1
      - nltk==3.8.1
      - python-dotenv==1.0.0
      - openpyxl==3.1.2
      - xlwt==1.3.0
      - xlrd==2.0.1
      - XlsxWriter==3.1.0

But I keep getting a couple of error messages and I cannot access the GPUs when running the PyTorch Python scripts.

1) while building the image in 20_image_build_log.txt I get a warning that the NVIDIA driver can't be detected: WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. (see the attached 20_image_build_log.txt at line 3408)

2) when executing the Python script during a run I get the following error message:

/bin/bash: /azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/libtinfo.so.6: no version information available (required by /bin/bash)
['/mnt/azureml/cr/j/7db89bab2dab4c479669caa5a4af805b/exe/wd', '/azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/python310.zip', '/azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/python3.10', '/azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/python3.10/lib-dynload', '/azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/python3.10/site-packages', '/azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/python3.10/site-packages/mpmath-1.2.1-py3.10.egg', '/mnt/azureml/cr/j/7db89bab2dab4c479669caa5a4af805b/exe/wd', '/mnt/azureml/cr/j/7db89bab2dab4c479669caa5a4af805b/exe', '/mnt/azureml/cr/j/7db89bab2dab4c479669caa5a4af805b']
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/7db89bab2dab4c479669caa5a4af805b/exe/wd/training_script.py", line 22, in <module>
    import torch
  File "/azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/python3.10/site-packages/torch/__init__.py", line 229, in <module>
    from torch._C import *  # noqa: F403
ImportError: /azureml-envs/azureml_8536d86a92a3687bb89e1cff176aedf4/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: cudaGraphInstantiateWithFlags, version libcudart.so.11.0

What am I doing wrong? Thanks for your help.

vizhur commented 1 year ago

If you want to install your dependencies into isolated environment, I'd suggest to have a bare nvidia base image. There is not much value of using acpt image or curaed environment image built on top of it. If you want to extend acpt or ce environment, you should install additional packages in the same (active) environment. Smth like

FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7 RUN pip install my-favorite-package

prashantguleria commented 1 year ago

I am facing the same issue, After this I just removed all my code and just added the

job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv",
        ),
        test_train_ratio=0.2,
        learning_rate=0.25,
        registered_model_name=registered_model_name,
    ),
    code="../src/",  # location of source code
    command="python ./biobert-pytorch-master/relation-extraction/test.py",
    environment="acpt-pytorch-1.11-cuda11.3:1",
    description="this is an job to train a adverse event model",
    compute="very-costy-compute-cluster-new",
    experiment_name="env_name",
    tags={"tensorflow-gpu": "1.15.2"},
    #docker_args= "--gpus all",
    display_name="adverse_event_prediction_display_name",
    environment_variables=dict(
        AZUREML_ARTIFACTS_DEFAULT_TIMEOUT=1200,
        CUDA_VISIBLE_DEVICES=1,
        NVIDIA_VISIBLE_DEVICES=all
    )

my test.py only print if device is gpu or cpu:

import torch 

print(torch.__version__)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

    #Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

Host tools log print that htere is GPU in host machine :

2023-08-26T12:46:27.488065Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:46:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:46:27.510219Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:46:57.474991Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Top Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Top Metrics"
2023-08-26T12:46:57.475036Z  INFO hosttools_capability::hosttools: stderr: bash: top: command not found command="/usr/local/bin/hosttools" line="bash: top: command not found"
2023-08-26T12:46:57.475962Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Free Memory Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Free Memory Metrics"
2023-08-26T12:46:57.475978Z  INFO hosttools_capability::hosttools: stderr: bash: free: command not found command="/usr/local/bin/hosttools" line="bash: free: command not found"
2023-08-26T12:46:57.476873Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Docker Stats command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Docker Stats"
2023-08-26T12:46:57.476887Z  INFO hosttools_capability::hosttools: stderr: bash: docker: command not found command="/usr/local/bin/hosttools" line="bash: docker: command not found"
2023-08-26T12:46:57.476898Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:46:57.476910Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:46:57.488696Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:46:57.514861Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:46:57.601170Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Nvidia-Smi command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Nvidia-Smi"
2023-08-26T12:46:57.601203Z  INFO hosttools_capability::hosttools: stderr: Sat Aug 26 12:46:57 2023        command="/usr/local/bin/hosttools" line="Sat Aug 26 12:46:57 2023       "
2023-08-26T12:46:57.601218Z  INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:46:57.601229Z  INFO hosttools_capability::hosttools: stderr: | NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     | command="/usr/local/bin/hosttools" line="| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |"
2023-08-26T12:46:57.601240Z  INFO hosttools_capability::hosttools: stderr: |-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="|-------------------------------+----------------------+----------------------+"
2023-08-26T12:46:57.601250Z  INFO hosttools_capability::hosttools: stderr: | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | command="/usr/local/bin/hosttools" line="| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |"
2023-08-26T12:46:57.601260Z  INFO hosttools_capability::hosttools: stderr: | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | command="/usr/local/bin/hosttools" line="| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |"
2023-08-26T12:46:57.601271Z  INFO hosttools_capability::hosttools: stderr: |                               |                      |               MIG M. | command="/usr/local/bin/hosttools" line="|                               |                      |               MIG M. |"
2023-08-26T12:46:57.601281Z  INFO hosttools_capability::hosttools: stderr: |===============================+======================+======================| command="/usr/local/bin/hosttools" line="|===============================+======================+======================|"
2023-08-26T12:46:57.601301Z  INFO hosttools_capability::hosttools: stderr: |   0  Tesla T4            On   | 00000001:00:00.0 Off |                  Off | command="/usr/local/bin/hosttools" line="|   0  Tesla T4            On   | 00000001:00:00.0 Off |                  Off |"
2023-08-26T12:46:57.601319Z  INFO hosttools_capability::hosttools: stderr: | N/A   31C    P8    11W /  70W |      0MiB / 16127MiB |      0%      Default | command="/usr/local/bin/hosttools" line="| N/A   31C    P8    11W /  70W |      0MiB / 16127MiB |      0%      Default |"
2023-08-26T12:46:57.601334Z  INFO hosttools_capability::hosttools: stderr: |                               |                      |                  N/A | command="/usr/local/bin/hosttools" line="|                               |                      |                  N/A |"
2023-08-26T12:46:57.601345Z  INFO hosttools_capability::hosttools: stderr: +-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="+-------------------------------+----------------------+----------------------+"
2023-08-26T12:46:57.601357Z  INFO hosttools_capability::hosttools: stderr:                                                                                 command="/usr/local/bin/hosttools" line="                                                                               "
2023-08-26T12:46:57.601369Z  INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:46:57.601380Z  INFO hosttools_capability::hosttools: stderr: | Processes:                                                                  | command="/usr/local/bin/hosttools" line="| Processes:                                                                  |"
2023-08-26T12:46:57.601395Z  INFO hosttools_capability::hosttools: stderr: |  GPU   GI   CI        PID   Type   Process name                  GPU Memory | command="/usr/local/bin/hosttools" line="|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |"
2023-08-26T12:46:57.601405Z  INFO hosttools_capability::hosttools: stderr: |        ID   ID                                                   Usage      | command="/usr/local/bin/hosttools" line="|        ID   ID                                                   Usage      |"
2023-08-26T12:46:57.601416Z  INFO hosttools_capability::hosttools: stderr: |=============================================================================| command="/usr/local/bin/hosttools" line="|=============================================================================|"
2023-08-26T12:46:57.601428Z  INFO hosttools_capability::hosttools: stderr: |  No running processes found                                                 | command="/usr/local/bin/hosttools" line="|  No running processes found                                                 |"
2023-08-26T12:46:57.601439Z  INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:47:07.204813Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:47:27.489249Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:47:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:47:27.517385Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:47:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:47:47.206564Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:47:57.489886Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:47:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:47:57.520009Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:47:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:48:27.209189Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:48:27.490130Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:48:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:48:27.526275Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:48:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:48:57.491142Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:48:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:48:57.530301Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:48:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:49:07.210691Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:49:27.491249Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:49:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:49:27.533405Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:49:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:49:47.212799Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:49:57.491623Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:49:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:49:57.537736Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:49:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:50:27.214634Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:50:27.491960Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:50:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:50:27.544162Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:50:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:50:57.492274Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:50:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:50:57.550464Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:50:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:51:07.217054Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:51:27.492669Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:51:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:51:27.553817Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:51:47.219931Z  INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:51:57.493405Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:51:57.559549Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:51:57.602843Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Top Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Top Metrics"
2023-08-26T12:51:57.602863Z  INFO hosttools_capability::hosttools: stderr: bash: top: command not found command="/usr/local/bin/hosttools" line="bash: top: command not found"
2023-08-26T12:51:57.603743Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Free Memory Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Free Memory Metrics"
2023-08-26T12:51:57.603767Z  INFO hosttools_capability::hosttools: stderr: bash: free: command not found command="/usr/local/bin/hosttools" line="bash: free: command not found"
2023-08-26T12:51:57.604604Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Docker Stats command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Docker Stats"
2023-08-26T12:51:57.604619Z  INFO hosttools_capability::hosttools: stderr: bash: docker: command not found command="/usr/local/bin/hosttools" line="bash: docker: command not found"
2023-08-26T12:51:57.604630Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:51:57.604642Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:51:57.681882Z  INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Nvidia-Smi command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Nvidia-Smi"
2023-08-26T12:51:57.681927Z  INFO hosttools_capability::hosttools: stderr: Sat Aug 26 12:51:57 2023        command="/usr/local/bin/hosttools" line="Sat Aug 26 12:51:57 2023       "
2023-08-26T12:51:57.681940Z  INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:51:57.681951Z  INFO hosttools_capability::hosttools: stderr: | NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     | command="/usr/local/bin/hosttools" line="| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |"
2023-08-26T12:51:57.681963Z  INFO hosttools_capability::hosttools: stderr: |-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="|-------------------------------+----------------------+----------------------+"
2023-08-26T12:51:57.681973Z  INFO hosttools_capability::hosttools: stderr: | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | command="/usr/local/bin/hosttools" line="| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |"
2023-08-26T12:51:57.681983Z  INFO hosttools_capability::hosttools: stderr: | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | command="/usr/local/bin/hosttools" line="| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |"
2023-08-26T12:51:57.681996Z  INFO hosttools_capability::hosttools: stderr: |                               |                      |               MIG M. | command="/usr/local/bin/hosttools" line="|                               |                      |               MIG M. |"
2023-08-26T12:51:57.682006Z  INFO hosttools_capability::hosttools: stderr: |===============================+======================+======================| command="/usr/local/bin/hosttools" line="|===============================+======================+======================|"
2023-08-26T12:51:57.682015Z  INFO hosttools_capability::hosttools: stderr: |   0  Tesla T4            On   | 00000001:00:00.0 Off |                  Off | command="/usr/local/bin/hosttools" line="|   0  Tesla T4            On   | 00000001:00:00.0 Off |                  Off |"
2023-08-26T12:51:57.682026Z  INFO hosttools_capability::hosttools: stderr: | N/A   31C    P8    11W /  70W |      0MiB / 16127MiB |      0%      Default | command="/usr/local/bin/hosttools" line="| N/A   31C    P8    11W /  70W |      0MiB / 16127MiB |      0%      Default |"
2023-08-26T12:51:57.682035Z  INFO hosttools_capability::hosttools: stderr: |                               |                      |                  N/A | command="/usr/local/bin/hosttools" line="|                               |                      |                  N/A |"
2023-08-26T12:51:57.682044Z  INFO hosttools_capability::hosttools: stderr: +-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="+-------------------------------+----------------------+----------------------+"
2023-08-26T12:51:57.682054Z  INFO hosttools_capability::hosttools: stderr:                                                                                 command="/usr/local/bin/hosttools" line="                                                                               "
2023-08-26T12:51:57.682063Z  INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:51:57.682073Z  INFO hosttools_capability::hosttools: stderr: | Processes:                                                                  | command="/usr/local/bin/hosttools" line="| Processes:                                                                  |"
2023-08-26T12:51:57.682083Z  INFO hosttools_capability::hosttools: stderr: |  GPU   GI   CI        PID   Type   Process name                  GPU Memory | command="/usr/local/bin/hosttools" line="|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |"
2023-08-26T12:51:57.682093Z  INFO hosttools_capability::hosttools: stderr: |        ID   ID                                                   Usage      | command="/usr/local/bin/hosttools" line="|        ID   ID                                                   Usage      |"
2023-08-26T12:51:57.682102Z  INFO hosttools_capability::hosttools: stderr: |=============================================================================| command="/usr/local/bin/hosttools" line="|=============================================================================|"
2023-08-26T12:51:57.682111Z  INFO hosttools_capability::hosttools: stderr: |  No running processes found                                                 | command="/usr/local/bin/hosttools" line="|  No running processes found                                                 |"
2023-08-26T12:51:57.682121Z  INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:52:27.222542Z  INFO grpc_utils::health: Watched child process is alive

and the run is successful with device type : cpu

1.11.0
Using device: cpu
prashantguleria commented 1 year ago

@movingabout Were you able to find a workaround this, Please let me know.

prashantguleria commented 1 year ago

Got it working, It was a problem with azure ml studio's base machines.

built my environment from UI extending this image :

acpt-pytorch-2.0-cuda11.7:latest

Also the warning about nvidia-toolkit missing is misleading and does not effect the GPU accessibility inside the container.

movingabout commented 1 year ago

I could not make it work unfortunately.

@prashantguleria, could you maybe describe what you did? Especially which Dockerfile and / or conda environment you specified?

Also, since you built it from the UI, which options did you choose for environment source, type etc.? I hope I understood you correctly regarding UI, but I'm talking about this form: image

Thank you!

prashantguleria commented 1 year ago

@movingabout Hi,

You can select anyone from the list of environemnt that is shown in your screenshot.

After choosing just change the image path that you want to use.

In my case I have used the following :

# This environment is an alias for 
# https://github.com/Azure/azureml-assets/tree/main/assets/training/general/environments/acpt-pytorch-1.13-cuda11.7
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:latest
# Install pip dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir

Just make sure that you don't install any dependency that affects "Pytorch" and "torch" in anyway.

From your question I can see that you are installing these dependecies :

dependencies:
  - python==3.10.9
  - pip
  - pytorch==2.0.1
  - torchvision
  - torchaudio  
  - pytorch-cuda=11.7
  - cudatoolkit=11.7

You can skip the follwoing as these might change the environement settings and your code might fail.

dependencies:
  - python==3.10.9
  - pip
  - pytorch==2.0.1
  - pytorch-cuda=11.7
  - cudatoolkit=11.7

Just use the version provided by the base image. Example : In this case I am using Pytorch 2.0

image
movingabout commented 1 year ago

Hi @prashantguleria! Thanks for the details, however I must bother you again :-)

I created an environment with this bare-bones Dockerfile:

# This environment is an alias for 
# https://github.com/Azure/azureml-assets/tree/main/assets/training/general/environments/acpt-pytorch-1.13-cuda11.7
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:latest

However, accessing the GPU does not seem to work:

1) in the build log I still get the warning:

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.

2) nvidia-smi seems to be accessible though:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
| N/A   48C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3) but the GPU is not detected by PyTorch:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
--> returns `cpu` 

I'm using a GPU cluster with STANDARD_NC6 VMs (which should not be a problem I assume).

Thanks!

prashantguleria commented 1 year ago

You can ignore this warning,

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.

I am still getting it and already highlighted to MS team. They will look into it if they feel it's necessary.

Can you please try using

Standard_NC6s_v3 as the one you mentioned in the questions are deprecated starting from this month.

https://azure.microsoft.com/en-us/updates/ncseries-azure-virtual-machines-will-be-retired-by-31-august-2022/

Remove all the dependecies if you are uing any and run thi script :

import torch 

print(torch.__version__)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

    #Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
movingabout commented 1 year ago

Hi @prashantguleria,

Thanks! Switching to another compute (Standard_NC24s_v3) did it!

For reference, this is the output I got based on your last post:

pytorch version = 2.0.1
Using device: cuda

Tesla V100-PCIE-16GB
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB
prashantguleria commented 1 year ago

Glad, It worked.

patcharees commented 1 year ago

Hi @prashantguleria , @movingabout,

I followed your guys tests and suggestions here. Thanks!

However when I run my docker image which is built on top of acpt-pytorch-2.0-cuda11.7:latest, I got this warning


DEPRECATION NOTICE!


THIS IMAGE IS DEPRECATED and is scheduled for DELETION. https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

I am a bit confused. Does it mean that we need to migrate to a new docker base soon?

Thanks in advance.

r8420 commented 10 months ago

This is probably the same issue I'm facing. I've tried this image: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu117-py38-torch201:biweekly.202310.3 When running the script that was posted on this issue I'm also getting this output:

2.0.1
Using device: cpu

The gpu is not getting recognized in the docker environment. But there's no problem when running nvidia-smi on the compute. The compute I'm using is: Standard_NC24ads_A100_v4 (switching is not an option) I have used the older mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu113-py38-torch1110 image in the past without any problems (with the same compute). Maybe the docker configuration is outdated on Standard_NC24ads_A100_v4 so newer images don't run properly? If someone finds a solution please let me know.

Edit: could also be related to https://github.com/Azure/MachineLearningNotebooks/issues/1839 2nd edit: running the same image and compute in a cluster solves the problem.