Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.36k stars 2.71k forks source link

Unable to log more than 200 parameters in a single job #35685

Open 0xfabioo opened 1 month ago

0xfabioo commented 1 month ago

Describe the bug Attempting to log more than 200 parameters in a job fails with the following error message:

INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'A field of the entity is over the size limit. FieldName=Parameters, Limit=200, Size=210. See https://aka.ms/azure-machine-learning-limits for service limits documentation.' .... 

This limitation is not documented at the provided link (https://aka.ms/azure-machine-learning-limits), nor could I find any reference to it elsewhere.

This limitation appears to originate from Azure Machine Learning (AML) itself, as the same code works correctly when using a local MLflow instance.

To Reproduce Steps to reproduce the behavior:

mlclient = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=subscription_id,
    resource_group_name=resource_group_name,
    workspace_name=workspace_name,
)

tracking_uri = mlclient.workspaces.get(workspace_name).mlflow_tracking_uri

mlflow.set_tracking_uri(tracking_uri)

mlflow.set_experiment("Default") # Use Default experiment

with mlflow.start_run() as run:
    for idx in range(0, 201):
        mlflow.log_param(f"param_{idx}", f"value_param_{idx}")

# Same with 
with mlflow.start_run() as run:
    for idx in range(0, 201):
        # sleep(0.1)
        mlflow.log_param(f"param_{idx}", f"value_param_{idx}")

Expected behavior It should be possible to log more than 200 parameters in a single job. If this limit is intentional, it should be clearly documented. Additionally, a limit of 200 parameters is quite low and can be reached quickly with the current state-of-the-art network architectures.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

github-actions[bot] commented 1 month ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

dsamarov-prudentiasciences commented 3 weeks ago

Following, is there any updates here?

hatboyzero commented 1 week ago

Curious about the status of this as well. Currently running into this issue when attempting to run a Phi-3 training job.