apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.56k stars 14.16k forks source link

Vertex AI - model versioning doesn't work with CreateAutoMLTextTrainingJobOperator #37400

Closed devinmnorris closed 6 months ago

devinmnorris commented 7 months ago

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==10.12.0

Apache Airflow version

2.6.3

Operating System

Ubuntu 22.04.3 LTS

Deployment

Docker-Compose

Deployment details

No response

What happened

When creating AutoML Text Training jobs using CreateAutoMLTextTrainingJobOperator and providing the resource name or model ID of an existing model to the parent_model parameter, an entirely new model with Version 1 shows up in Vertex AI Model Registry.

What you think should happen instead

Since we provided an argument to parent_model, the model uploaded by the job should be a version of the existing parent model.

image

How to reproduce

If your model registry already has an existing model to use as the parent model, skip to step 3. Otherwise:

  1. Train the initial model
  2. Get the initial model's resource name
  3. Train a new model, specifying parent_model=initial_model_resource_name
def get_parent_model(project_id: str):
    from google.cloud import aiplatform

    aiplatform.init(project=project_id)
    models = [m for m in aiplatform.Model.list()]
    models.sort(key=lambda m: m.version_update_time, reverse=True)

    return models[0].resource_name

with DAG as dag:
    initial_model = CreateAutoMLTextTrainingJobOperator(
        task_id="create_auto_ml_training_job-1",
        project_id=PROJECT_ID,
        region=REGION,
        display_name="automl-training-job-1",
        training_fraction_split=0.8,
        test_fraction_split=0.2,
        dataset_id=DATASET_ID,
        prediction_type="classification",
    )

    initial_model_resource_name = PythonVirtualenvOperator(
        task_id="initial_model_resource_name",
        python_callable=get_parent_model,
        requirements=["google-cloud-aiplatform"],
        op_kwargs={
            "project_id": PROJECT_ID,
        },
    )

    model_version_2 = CreateAutoMLTextTrainingJobOperator(
        task_id="create_auto_ml_training_job-2",
        project_id=PROJECT_ID,
        region=REGION,
        display_name="automl-training-job-2",
        parent_model=initial_model_resource_name.output,
        training_fraction_split=0.8,
        test_fraction_split=0.2,
        dataset_id=DATASET_ID,
        prediction_type="classification",
    )

    initial_model >> initial_model_resource_name >> model_version_2

Anything else

This problem only occurs when using the CreateAutoMLTextTrainingJobOperator, and not with the Vertex AI SDK for Python. For example, we were able to implement model versioning successfully using something like:

google-cloud-aiplatform==1.41.0

from google.cloud import aiplatform

aiplatform.init(project=PROJECT, location=LOCATION)

text_dataset = aiplatform.TextDataset(DATASET_ID)

job = aiplatform.AutoMLTextTrainingJob(
    display_name=display_name,
    prediction_type="classification",
    multi_label=False,
)

model = job.run(
    dataset=text_dataset,
    model_display_name=model_display_name,
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    parent_model=PARENT_MODEL_ID,
    is_default_version=is_default_version,
)

model.wait()

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 7 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

eladkal commented 7 months ago

cc @MaksYermak can you take a look? I think you completed the system tests for Vertex AI. if the tests passes then maybe we are missing some coverage with this bug?

VladaZakharova commented 7 months ago

Hi @devinmnorris ! Regarding your example from section "Anything else", can you please provide the value from PARENT_MODEL_ID parameter? As far as i see from the implementation we have, the operator indeed takes only model_id as input parameter, not the resource_name

devinmnorris commented 7 months ago

Hi @VladaZakharova :)

We tried the SDK and the Operator approach using both:

It seems that either work when using the SDK, and neither work when using the Operator.