Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.11k stars 2.52k forks source link

Job submission in the notebook doesn't work and no errors are given. #1969

Open WilliamDoman opened 3 months ago

WilliamDoman commented 3 months ago

Question.

I'm trying to learn to train a vision model and azure machine learning workspace notebooks.

I am trying to create an environment where i can run both Azure AI SK2 and pytourch to train a vision model and have access to data assets in both the notebook and the remote compute.

When I run my environment i can see the versions of packages are all correct.

The problem is that the notebook with my environment and kernel won't submit the job, but no errors and if i switch to the built in Python 3.10 - SDK V2 kernel it submits.

# Define the command job
job = command(
    code="./",  # Path to your training script
    command="python trainV2.py",  # Adjust to your script name
    inputs={
        "train_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}train_val_list_v2.txt"),
        "test_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}test_list_v2.txt"),
        "labels": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}Data_Entry_2017.csv"),
        "images": Input(type=AssetTypes.URI_FOLDER, path=f"{dataset.path}images")
    },
    outputs = {
        "outputFolder" : Output(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RW_MOUNT)
    },
    environment=environment,
    compute=compute_cluster_name,
    instance_count=1,
    display_name="exp",
    experiment_name="exp"
)

# Submit the job
results = ml_client.jobs.create_or_update(job)

The results i get in my environment.

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration

But if i runt he same code with the default Python 3.10 - SDK V2 kernel i get the same output but an additional line.

Uploading Exp (0.11 MBs): 100%|██████████| 107858/107858 [00:00<00:00, 970196.92it/s]

My environment configuration is using a standard image and adding to the requirements.txt the packages. I've done hundreds of versions of this but this is basically the latest rendition.

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202408.2

# Install pip dependencies
COPY requirements.txt .

#RUN pip install scikit-build==0.16.7 --no-cache-dir
RUN pip install -r requirements.txt --no-cache-dir

# Inference requirements
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=500
EXPOSE 5001 8883 8888

# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update
RUN apt-get install -y openssh-server openssh-client

With this in requirements.txt

# Azure ML SDK v2 packages
azure-ai-ml==1.16.1
azure-core==1.30.2
azure-identity==1.17.1
azure-storage-blob==12.22.0
azure-storage-file-datalake==12.16.0

# PyTorch and related packages
torch==2.2.2  # Match the internal version if necessary
torch-nebula==0.16.13  # If needed, otherwise omit
torch-ort==1.17.0  # If needed, otherwise omit
torchaudio==2.2.2+cu121
torchdata==0.7.1
torchmetrics==1.2.0
torch-tb-profiler==0.4.3
torchvision==0.17.2+cu121

# Core scientific packages
numpy>=1.23.0,<2.0    # ==1.23.0
pandas==1.5.0
#scikit-image>=0.21.0
#SimpleITK==2.1.0
matplotlib==3.5.0
pydicom==2.3.0
pybind11==2.13.4
regex==2024.7.24

# Data handling and serialization
pyarrow==14.0.2  # Match the version in the successful environment
fsspec  # Match the successful environment's version ==2024.10.0

# Additional dependencies
albumentations==1.4.14  # As per your original list
mltable==1.6.1
tqdm==4.66.5
urllib3==2.2.2
cryptography==43.0.0
aiohttp==3.10.1
py-spy==0.3.12
debugpy==1.6.7.post1
ipykernel==6.29.5
tensorboard==2.17.1
psutil==5.8.0
Pillow==10.4.0
plotly==5.23.0
dcmstack==0.9.0
nataliameira commented 2 months ago

You can find a task in the environment that you performed and select it. Then go to logs.