I'm trying to learn to train a vision model and azure machine learning workspace notebooks.
I am trying to create an environment where i can run both Azure AI SK2 and pytourch to train a vision model and have access to data assets in both the notebook and the remote compute.
When I run my environment i can see the versions of packages are all correct.
The problem is that the notebook with my environment and kernel won't submit the job, but no errors and if i switch to the built in Python 3.10 - SDK V2 kernel it submits.
# Define the command job
job = command(
code="./", # Path to your training script
command="python trainV2.py", # Adjust to your script name
inputs={
"train_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}train_val_list_v2.txt"),
"test_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}test_list_v2.txt"),
"labels": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}Data_Entry_2017.csv"),
"images": Input(type=AssetTypes.URI_FOLDER, path=f"{dataset.path}images")
},
outputs = {
"outputFolder" : Output(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RW_MOUNT)
},
environment=environment,
compute=compute_cluster_name,
instance_count=1,
display_name="exp",
experiment_name="exp"
)
# Submit the job
results = ml_client.jobs.create_or_update(job)
The results i get in my environment.
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration
But if i runt he same code with the default Python 3.10 - SDK V2 kernel i get the same output but an additional line.
My environment configuration is using a standard image and adding to the requirements.txt the packages. I've done hundreds of versions of this but this is basically the latest rendition.
# Azure ML SDK v2 packages
azure-ai-ml==1.16.1
azure-core==1.30.2
azure-identity==1.17.1
azure-storage-blob==12.22.0
azure-storage-file-datalake==12.16.0
# PyTorch and related packages
torch==2.2.2 # Match the internal version if necessary
torch-nebula==0.16.13 # If needed, otherwise omit
torch-ort==1.17.0 # If needed, otherwise omit
torchaudio==2.2.2+cu121
torchdata==0.7.1
torchmetrics==1.2.0
torch-tb-profiler==0.4.3
torchvision==0.17.2+cu121
# Core scientific packages
numpy>=1.23.0,<2.0 # ==1.23.0
pandas==1.5.0
#scikit-image>=0.21.0
#SimpleITK==2.1.0
matplotlib==3.5.0
pydicom==2.3.0
pybind11==2.13.4
regex==2024.7.24
# Data handling and serialization
pyarrow==14.0.2 # Match the version in the successful environment
fsspec # Match the successful environment's version ==2024.10.0
# Additional dependencies
albumentations==1.4.14 # As per your original list
mltable==1.6.1
tqdm==4.66.5
urllib3==2.2.2
cryptography==43.0.0
aiohttp==3.10.1
py-spy==0.3.12
debugpy==1.6.7.post1
ipykernel==6.29.5
tensorboard==2.17.1
psutil==5.8.0
Pillow==10.4.0
plotly==5.23.0
dcmstack==0.9.0
Question.
I'm trying to learn to train a vision model and azure machine learning workspace notebooks.
I am trying to create an environment where i can run both Azure AI SK2 and pytourch to train a vision model and have access to data assets in both the notebook and the remote compute.
When I run my environment i can see the versions of packages are all correct.
The problem is that the notebook with my environment and kernel won't submit the job, but no errors and if i switch to the built in Python 3.10 - SDK V2 kernel it submits.
The results i get in my environment.
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration
But if i runt he same code with the default Python 3.10 - SDK V2 kernel i get the same output but an additional line.
Uploading Exp (0.11 MBs): 100%|██████████| 107858/107858 [00:00<00:00, 970196.92it/s]
My environment configuration is using a standard image and adding to the requirements.txt the packages. I've done hundreds of versions of this but this is basically the latest rendition.
With this in requirements.txt