Open movingabout opened 1 year ago
If you want to install your dependencies into isolated environment, I'd suggest to have a bare nvidia base image. There is not much value of using acpt image or curaed environment image built on top of it. If you want to extend acpt or ce environment, you should install additional packages in the same (active) environment. Smth like
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7 RUN pip install my-favorite-package
I am facing the same issue, After this I just removed all my code and just added the
job = command(
inputs=dict(
data=Input(
type="uri_file",
path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv",
),
test_train_ratio=0.2,
learning_rate=0.25,
registered_model_name=registered_model_name,
),
code="../src/", # location of source code
command="python ./biobert-pytorch-master/relation-extraction/test.py",
environment="acpt-pytorch-1.11-cuda11.3:1",
description="this is an job to train a adverse event model",
compute="very-costy-compute-cluster-new",
experiment_name="env_name",
tags={"tensorflow-gpu": "1.15.2"},
#docker_args= "--gpus all",
display_name="adverse_event_prediction_display_name",
environment_variables=dict(
AZUREML_ARTIFACTS_DEFAULT_TIMEOUT=1200,
CUDA_VISIBLE_DEVICES=1,
NVIDIA_VISIBLE_DEVICES=all
)
my test.py only print if device is gpu or cpu:
import torch
print(torch.__version__)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()
#Additional Info when using cuda
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
print('Memory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached: ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
Host tools log print that htere is GPU in host machine :
2023-08-26T12:46:27.488065Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:46:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:46:27.510219Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:46:57.474991Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Top Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Top Metrics"
2023-08-26T12:46:57.475036Z INFO hosttools_capability::hosttools: stderr: bash: top: command not found command="/usr/local/bin/hosttools" line="bash: top: command not found"
2023-08-26T12:46:57.475962Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Free Memory Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Free Memory Metrics"
2023-08-26T12:46:57.475978Z INFO hosttools_capability::hosttools: stderr: bash: free: command not found command="/usr/local/bin/hosttools" line="bash: free: command not found"
2023-08-26T12:46:57.476873Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Docker Stats command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Docker Stats"
2023-08-26T12:46:57.476887Z INFO hosttools_capability::hosttools: stderr: bash: docker: command not found command="/usr/local/bin/hosttools" line="bash: docker: command not found"
2023-08-26T12:46:57.476898Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:46:57.476910Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:46:57.488696Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:46:57.514861Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:46:57.601170Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:46:57 Nvidia-Smi command="/usr/local/bin/hosttools" line="2023/08/26 12:46:57 Nvidia-Smi"
2023-08-26T12:46:57.601203Z INFO hosttools_capability::hosttools: stderr: Sat Aug 26 12:46:57 2023 command="/usr/local/bin/hosttools" line="Sat Aug 26 12:46:57 2023 "
2023-08-26T12:46:57.601218Z INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:46:57.601229Z INFO hosttools_capability::hosttools: stderr: | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | command="/usr/local/bin/hosttools" line="| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |"
2023-08-26T12:46:57.601240Z INFO hosttools_capability::hosttools: stderr: |-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="|-------------------------------+----------------------+----------------------+"
2023-08-26T12:46:57.601250Z INFO hosttools_capability::hosttools: stderr: | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | command="/usr/local/bin/hosttools" line="| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |"
2023-08-26T12:46:57.601260Z INFO hosttools_capability::hosttools: stderr: | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | command="/usr/local/bin/hosttools" line="| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |"
2023-08-26T12:46:57.601271Z INFO hosttools_capability::hosttools: stderr: | | | MIG M. | command="/usr/local/bin/hosttools" line="| | | MIG M. |"
2023-08-26T12:46:57.601281Z INFO hosttools_capability::hosttools: stderr: |===============================+======================+======================| command="/usr/local/bin/hosttools" line="|===============================+======================+======================|"
2023-08-26T12:46:57.601301Z INFO hosttools_capability::hosttools: stderr: | 0 Tesla T4 On | 00000001:00:00.0 Off | Off | command="/usr/local/bin/hosttools" line="| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |"
2023-08-26T12:46:57.601319Z INFO hosttools_capability::hosttools: stderr: | N/A 31C P8 11W / 70W | 0MiB / 16127MiB | 0% Default | command="/usr/local/bin/hosttools" line="| N/A 31C P8 11W / 70W | 0MiB / 16127MiB | 0% Default |"
2023-08-26T12:46:57.601334Z INFO hosttools_capability::hosttools: stderr: | | | N/A | command="/usr/local/bin/hosttools" line="| | | N/A |"
2023-08-26T12:46:57.601345Z INFO hosttools_capability::hosttools: stderr: +-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="+-------------------------------+----------------------+----------------------+"
2023-08-26T12:46:57.601357Z INFO hosttools_capability::hosttools: stderr: command="/usr/local/bin/hosttools" line=" "
2023-08-26T12:46:57.601369Z INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:46:57.601380Z INFO hosttools_capability::hosttools: stderr: | Processes: | command="/usr/local/bin/hosttools" line="| Processes: |"
2023-08-26T12:46:57.601395Z INFO hosttools_capability::hosttools: stderr: | GPU GI CI PID Type Process name GPU Memory | command="/usr/local/bin/hosttools" line="| GPU GI CI PID Type Process name GPU Memory |"
2023-08-26T12:46:57.601405Z INFO hosttools_capability::hosttools: stderr: | ID ID Usage | command="/usr/local/bin/hosttools" line="| ID ID Usage |"
2023-08-26T12:46:57.601416Z INFO hosttools_capability::hosttools: stderr: |=============================================================================| command="/usr/local/bin/hosttools" line="|=============================================================================|"
2023-08-26T12:46:57.601428Z INFO hosttools_capability::hosttools: stderr: | No running processes found | command="/usr/local/bin/hosttools" line="| No running processes found |"
2023-08-26T12:46:57.601439Z INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:47:07.204813Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:47:27.489249Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:47:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:47:27.517385Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:47:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:47:47.206564Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:47:57.489886Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:47:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:47:57.520009Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:47:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:47:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:48:27.209189Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:48:27.490130Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:48:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:48:27.526275Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:48:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:48:57.491142Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:48:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:48:57.530301Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:48:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:48:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:49:07.210691Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:49:27.491249Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:49:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:49:27.533405Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:49:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:49:47.212799Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:49:57.491623Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:49:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:49:57.537736Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:49:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:49:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:50:27.214634Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:50:27.491960Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:50:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:50:27.544162Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:50:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:50:57.492274Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:50:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:50:57.550464Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:50:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:50:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:51:07.217054Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:51:27.492669Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:27 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:51:27 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:51:27.553817Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:27 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:51:47.219931Z INFO grpc_utils::health: Watched child process is alive
2023-08-26T12:51:57.493405Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 file LS_root/jobs/wd/.tmp does not exist command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 file LS_root/jobs/wd/.tmp does not exist"
2023-08-26T12:51:57.559549Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Not exporting to RunHistory as the exporter is either stopped or there is no data.Stopped: false; OriginalData: 10; FilteredData: 0."
2023-08-26T12:51:57.602843Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Top Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Top Metrics"
2023-08-26T12:51:57.602863Z INFO hosttools_capability::hosttools: stderr: bash: top: command not found command="/usr/local/bin/hosttools" line="bash: top: command not found"
2023-08-26T12:51:57.603743Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Free Memory Metrics command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Free Memory Metrics"
2023-08-26T12:51:57.603767Z INFO hosttools_capability::hosttools: stderr: bash: free: command not found command="/usr/local/bin/hosttools" line="bash: free: command not found"
2023-08-26T12:51:57.604604Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Docker Stats command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Docker Stats"
2023-08-26T12:51:57.604619Z INFO hosttools_capability::hosttools: stderr: bash: docker: command not found command="/usr/local/bin/hosttools" line="bash: docker: command not found"
2023-08-26T12:51:57.604630Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:51:57.604642Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command. command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 The vmsize STANDARD_NC4AS_T4_V3 is a GPU VM, running nvidia-smi command."
2023-08-26T12:51:57.681882Z INFO hosttools_capability::hosttools: stderr: 2023/08/26 12:51:57 Nvidia-Smi command="/usr/local/bin/hosttools" line="2023/08/26 12:51:57 Nvidia-Smi"
2023-08-26T12:51:57.681927Z INFO hosttools_capability::hosttools: stderr: Sat Aug 26 12:51:57 2023 command="/usr/local/bin/hosttools" line="Sat Aug 26 12:51:57 2023 "
2023-08-26T12:51:57.681940Z INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:51:57.681951Z INFO hosttools_capability::hosttools: stderr: | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | command="/usr/local/bin/hosttools" line="| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |"
2023-08-26T12:51:57.681963Z INFO hosttools_capability::hosttools: stderr: |-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="|-------------------------------+----------------------+----------------------+"
2023-08-26T12:51:57.681973Z INFO hosttools_capability::hosttools: stderr: | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | command="/usr/local/bin/hosttools" line="| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |"
2023-08-26T12:51:57.681983Z INFO hosttools_capability::hosttools: stderr: | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | command="/usr/local/bin/hosttools" line="| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |"
2023-08-26T12:51:57.681996Z INFO hosttools_capability::hosttools: stderr: | | | MIG M. | command="/usr/local/bin/hosttools" line="| | | MIG M. |"
2023-08-26T12:51:57.682006Z INFO hosttools_capability::hosttools: stderr: |===============================+======================+======================| command="/usr/local/bin/hosttools" line="|===============================+======================+======================|"
2023-08-26T12:51:57.682015Z INFO hosttools_capability::hosttools: stderr: | 0 Tesla T4 On | 00000001:00:00.0 Off | Off | command="/usr/local/bin/hosttools" line="| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |"
2023-08-26T12:51:57.682026Z INFO hosttools_capability::hosttools: stderr: | N/A 31C P8 11W / 70W | 0MiB / 16127MiB | 0% Default | command="/usr/local/bin/hosttools" line="| N/A 31C P8 11W / 70W | 0MiB / 16127MiB | 0% Default |"
2023-08-26T12:51:57.682035Z INFO hosttools_capability::hosttools: stderr: | | | N/A | command="/usr/local/bin/hosttools" line="| | | N/A |"
2023-08-26T12:51:57.682044Z INFO hosttools_capability::hosttools: stderr: +-------------------------------+----------------------+----------------------+ command="/usr/local/bin/hosttools" line="+-------------------------------+----------------------+----------------------+"
2023-08-26T12:51:57.682054Z INFO hosttools_capability::hosttools: stderr: command="/usr/local/bin/hosttools" line=" "
2023-08-26T12:51:57.682063Z INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:51:57.682073Z INFO hosttools_capability::hosttools: stderr: | Processes: | command="/usr/local/bin/hosttools" line="| Processes: |"
2023-08-26T12:51:57.682083Z INFO hosttools_capability::hosttools: stderr: | GPU GI CI PID Type Process name GPU Memory | command="/usr/local/bin/hosttools" line="| GPU GI CI PID Type Process name GPU Memory |"
2023-08-26T12:51:57.682093Z INFO hosttools_capability::hosttools: stderr: | ID ID Usage | command="/usr/local/bin/hosttools" line="| ID ID Usage |"
2023-08-26T12:51:57.682102Z INFO hosttools_capability::hosttools: stderr: |=============================================================================| command="/usr/local/bin/hosttools" line="|=============================================================================|"
2023-08-26T12:51:57.682111Z INFO hosttools_capability::hosttools: stderr: | No running processes found | command="/usr/local/bin/hosttools" line="| No running processes found |"
2023-08-26T12:51:57.682121Z INFO hosttools_capability::hosttools: stderr: +-----------------------------------------------------------------------------+ command="/usr/local/bin/hosttools" line="+-----------------------------------------------------------------------------+"
2023-08-26T12:52:27.222542Z INFO grpc_utils::health: Watched child process is alive
and the run is successful with device type : cpu
1.11.0
Using device: cpu
@movingabout Were you able to find a workaround this, Please let me know.
Got it working, It was a problem with azure ml studio's base machines.
built my environment from UI extending this image :
acpt-pytorch-2.0-cuda11.7:latest
Also the warning about nvidia-toolkit missing is misleading and does not effect the GPU accessibility inside the container.
I could not make it work unfortunately.
@prashantguleria, could you maybe describe what you did? Especially which Dockerfile and / or conda environment you specified?
Also, since you built it from the UI, which options did you choose for environment source, type etc.? I hope I understood you correctly regarding UI, but I'm talking about this form:
Thank you!
@movingabout Hi,
You can select anyone from the list of environemnt that is shown in your screenshot.
After choosing just change the image path that you want to use.
In my case I have used the following :
# This environment is an alias for
# https://github.com/Azure/azureml-assets/tree/main/assets/training/general/environments/acpt-pytorch-1.13-cuda11.7
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:latest
# Install pip dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir
Just make sure that you don't install any dependency that affects "Pytorch" and "torch" in anyway.
From your question I can see that you are installing these dependecies :
dependencies:
- python==3.10.9
- pip
- pytorch==2.0.1
- torchvision
- torchaudio
- pytorch-cuda=11.7
- cudatoolkit=11.7
You can skip the follwoing as these might change the environement settings and your code might fail.
dependencies:
- python==3.10.9
- pip
- pytorch==2.0.1
- pytorch-cuda=11.7
- cudatoolkit=11.7
Just use the version provided by the base image. Example : In this case I am using Pytorch 2.0
Hi @prashantguleria! Thanks for the details, however I must bother you again :-)
I created an environment with this bare-bones Dockerfile:
# This environment is an alias for
# https://github.com/Azure/azureml-assets/tree/main/assets/training/general/environments/acpt-pytorch-1.13-cuda11.7
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:latest
However, accessing the GPU does not seem to work:
1) in the build log I still get the warning:
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
2) nvidia-smi seems to be accessible though:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000001:00:00.0 Off | 0 |
| N/A 48C P8 28W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
3) but the GPU is not detected by PyTorch:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
--> returns `cpu`
I'm using a GPU cluster with STANDARD_NC6
VMs (which should not be a problem I assume).
Thanks!
You can ignore this warning,
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
I am still getting it and already highlighted to MS team. They will look into it if they feel it's necessary.
Can you please try using
Standard_NC6s_v3
as the one you mentioned in the questions are deprecated starting from this month.
Remove all the dependecies if you are uing any and run thi script :
import torch
print(torch.__version__)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()
#Additional Info when using cuda
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
print('Memory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached: ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
Hi @prashantguleria,
Thanks!
Switching to another compute (Standard_NC24s_v3
) did it!
For reference, this is the output I got based on your last post:
pytorch version = 2.0.1
Using device: cuda
Tesla V100-PCIE-16GB
Memory Usage:
Allocated: 0.0 GB
Cached: 0.0 GB
Glad, It worked.
Hi @prashantguleria , @movingabout,
I followed your guys tests and suggestions here. Thanks!
However when I run my docker image which is built on top of acpt-pytorch-2.0-cuda11.7:latest, I got this warning
DEPRECATION NOTICE!
THIS IMAGE IS DEPRECATED and is scheduled for DELETION. https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
I am a bit confused. Does it mean that we need to migrate to a new docker base soon?
Thanks in advance.
This is probably the same issue I'm facing. I've tried this image: mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu117-py38-torch201:biweekly.202310.3 When running the script that was posted on this issue I'm also getting this output:
2.0.1
Using device: cpu
The gpu is not getting recognized in the docker environment. But there's no problem when running nvidia-smi on the compute. The compute I'm using is: Standard_NC24ads_A100_v4 (switching is not an option) I have used the older mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu113-py38-torch1110 image in the past without any problems (with the same compute). Maybe the docker configuration is outdated on Standard_NC24ads_A100_v4 so newer images don't run properly? If someone finds a solution please let me know.
Edit: could also be related to https://github.com/Azure/MachineLearningNotebooks/issues/1839 2nd edit: running the same image and compute in a cluster solves the problem.
In Azure Machine Learning, I am trying to set up a PyTorch 2.0.1 run on a curated ACPT image:
mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:latest
(based on this and this documentation)I use the image above on a GPU cluster with size
STANDARD_NC6
and this conda environment yaml:But I keep getting a couple of error messages and I cannot access the GPUs when running the PyTorch Python scripts.
1) while building the image in
20_image_build_log.txt
I get a warning that the NVIDIA driver can't be detected:WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
(see the attached 20_image_build_log.txt at line 3408)2) when executing the Python script during a run I get the following error message:
What am I doing wrong? Thanks for your help.