apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.09k stars 14.29k forks source link

amazon provider converts values to int when the tuning operator expect it as string #43552

Open francesco-camussoni-ueno opened 1 week ago

francesco-camussoni-ueno commented 1 week ago

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

8.2.0

Apache Airflow version

2.6.3

Operating System

mw1.small

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

I have this task related to a tuning job on a dag:

tuning_dict = {"task_id": "tuning", "config": {"HyperParameterTuningJobConfig": {"ParameterRanges": {"CategoricalParameterRanges": [{"Name": "max_features", "Values": ["sqrt", "log2"]}, {"Name": "criterion", "Values": ["gini", "entropy", "log_loss"]}], "ContinuousParameterRanges": [{"Name": "ccp_alpha", "MinValue": "0.0", "MaxValue": "0.02"}], "IntegerParameterRanges": [{"Name": "min_samples_leaf", "MinValue": "2", "MaxValue": "15"}, {"Name": "n_estimators", "MinValue": "50", "MaxValue": "500"}]}, "HyperParameterTuningJobObjective": {"Name": "validation:accuracy", "Type": "Maximize"}, "Strategy": "Bayesian", "RandomSeed": 123}, ...

The key ContinuousParameterRanges contains some hyperparameters for mi tunning job that are casted as a string. This is a must based on the TuningOperator: https://github.com/apache/airflow/blob/providers-amazon/3.4.0/airflow/providers/amazon/aws/example_dags/example_sagemaker.py (line 202).

But I'm seeing that they are converted to float in the case of ContinuousParameterRanges or to int in the case of IntegerParameterRanges because of this bunch of code: https://github.com/apache/airflow/blob/providers-amazon/8.20.0/airflow/providers/amazon/aws/operators/sagemaker.py (line 99 or function parse_config_integers/parse_integers)

So when I execute the dag I get this kind of erros:

Invalid type for parameter HyperParameterTuningJobConfig.ParameterRanges.ContinuousParameterRanges[0].MinValue, value: 0.0, type: <class 'float'>, valid types: <class 'str'>
Invalid type for parameter HyperParameterTuningJobConfig.ParameterRanges.ContinuousParameterRanges[0].MaxValue, value: 0.02, type: <class 'float'>, valid types: <class 'str'>
Invalid type for parameter HyperParameterTuningJobConfig.ParameterRanges.IntegerParameterRanges[0].MinValue, value: 2, type: <class 'int'>, valid types: <class 'str'>
Invalid type for parameter HyperParameterTuningJobConfig.ParameterRanges.IntegerParameterRanges[0].MaxValue, value: 15, type: <class 'int'>, valid types: <class 'str'>
Invalid type for parameter HyperParameterTuningJobConfig.ParameterRanges.IntegerParameterRanges[1].MinValue, value: 50, type: <class 'int'>, valid types: <class 'str'>
Invalid type for parameter HyperParameterTuningJobConfig.ParameterRanges.IntegerParameterRanges[1].MaxValue, value: 500, type: <class 'int'>, valid types: <class 'str'>

Any help?

What you think should happen instead

I think that those parameters don't have to be converted as float or string

How to reproduce

Generate a dag with this task:

'tuning_dict = {"task_id": "tuning", "config": {"HyperParameterTuningJobConfig": {"ParameterRanges": {"CategoricalParameterRanges": [{"Name": "max_features", "Values": ["sqrt", "log2"]}, {"Name": "criterion", "Values": ["gini", "entropy", "log_loss"]}], "ContinuousParameterRanges": [{"Name": "ccp_alpha", "MinValue": "0.0", "MaxValue": "0.02"}], "IntegerParameterRanges": [{"Name": "min_samples_leaf", "MinValue": "2", "MaxValue": "15"}, {"Name": "n_estimators", "MinValue": "50", "MaxValue": "500"}]}, "HyperParameterTuningJobObjective": {"Name": "validation:accuracy", "Type": "Maximize"}, "Strategy": "Bayesian", "RandomSeed": 123}, "ResourceLimits": {"MaxNumberOfTrainingJobs": 10, "MaxParallelTrainingJobs": 4, "MaxRuntimeInSeconds": 7200}, "Tags": [{"Key": "USER", "Value": "santiago.sarratea@itti.digital"}, {"Key": "TRIBU", "Value": "Central Data"}, {"Key": "SQUAD", "Value": "Personalization and Relevance"}, {"Key": "ONLINE_OR_BATCH", "Value": "batch"}, {"Key": "PREDICTION_TYPE", "Value": "clasificacion binaria"}, {"Key": "VERSION_DESCRIPTION", "Value": "Version inicial"}, {"Key": "DESCRIPTION", "Value": "Desarrollo de deployment de pipeline de entrenamiento"}], "HyperParameterTuningJobName": "mlpipeline-training-tuning", "TrainingJobDefinition": {"AlgorithmSpecification": {"TrainingImage": "", "TrainingInputMode": "File", "MetricDefinitions": [{"Name": "validation:accuracy", "Regex": "validation-accuracy=(.?);"}, {"Name": "validation:recall", "Regex": "validation-recall=(.?);"}, {"Name": "validation:precision", "Regex": "validation-precision=(.*?);"}]}, "InputDataConfig": [{"ChannelName": "ingestion", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "", "S3DataDistributionType": "FullyReplicated"}}}], "OutputDataConfig": {"S3OutputPath": "s3://pr-ueno-prod-sagemaker/ml-projects/mlpipeline/training_pipeline/tuning/output"}, "ResourceConfig": {"InstanceType": "ml.m5.large", "InstanceCount": 1, "VolumeSizeInGB": 10}, "StoppingCondition": {"MaxRuntimeInSeconds": 7200}, "RoleArn": "", "StaticHyperParameters": {}}}}`

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 week ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

eladkal commented 1 week ago

cc @ferruzzi @vincbeck

ferruzzi commented 1 week ago

Interesting. Are you using the SageMakerTuningOperator for this? I'm not sure the issue is quite what you think it is. If you look in the tuning operator where the integer fields are defined (on L875), neither of the ones you point out are being flagged for converting to ints. And if you are using the create_tuning_job hook directly, it doesn't appear to be processing the config at all.

[EDIT: In fact, none of the existing official operators or hooks seem to have "ContinuousParameterRanges" listed as a field which needs to be converted to an int...]

francesco-camussoni-ueno commented 1 week ago

I know is executiing this file: operators/sagemaker.py

Because in the log I have this INFO log After preprocessing the config is:

[2024-10-31, 15:57:06 UTC] {{sagemaker.py:96}} INFO - Preprocessing the config and doing required s3_operations
[2024-10-31, 15:57:06 UTC] {{sagemaker.py:100}} INFO - After preprocessing the config is:
 {
    "HyperParameterTuningJobConfig": {
        "HyperParameterTuningJobObjective": {
            "Name": "validation:accuracy",
            "Type": "Maximize"
        },
        "ParameterRanges": {
            "CategoricalParameterRanges": [
                {
                    "Name": "max_features",
                    "Values": [
                        "sqrt",
                        "log2"
                    ]
                },
                {
                    "Name": "criterion",
                    "Values": [
                        "gini",
                        "entropy",
                        "log_loss"
                    ]
                }
            ],
            "ContinuousParameterRanges": [
                {
                    "MaxValue": 0.02,
                    "MinValue": 0.0,
                    "Name": "ccp_alpha"
                }
            ],
            "IntegerParameterRanges": [
                {
                    "MaxValue": 15,
                    "MinValue": 2,
                    "Name": "min_samples_leaf"
                },
                {
                    "MaxValue": 500,
                    "MinValue": 50,
                    "Name": "n_estimators"
                }
            ]
        },
        "RandomSeed": 123,
        "Strategy": "Bayesian"
    },
    "HyperParameterTuningJobName": "mlpipeline-training-tuning",
    "ResourceLimits": {
        "MaxNumberOfTrainingJobs": 10,
        "MaxParallelTrainingJobs": 4,
        "MaxRuntimeInSeconds": 7200
    },

Basically, the problem is that after the transformation I'm seeing those values as int or float instead of str