apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.52k stars 14.14k forks source link

BeamRunJavaPipelineOperator fails without job_name set #40515

Closed turb closed 2 months ago

turb commented 3 months ago

Apache Airflow Provider(s)

apache-beam

Versions of Apache Airflow Providers

Can't find the actual information in Google Cloud Composer.

Some info in GCP doc

Apache Airflow version

2.7.3

Operating System

Google Cloud Composer

Deployment

Google Cloud Composer

Deployment details

Google Cloud Composer 2.8.2

What happened

BeamRunJavaPipelineOperator was running perfectly on Airflow 2.5.

After upgrading Google Cloud Composer with Airflow 2.6:

Failed to execute job xxx for task yyyy (Invalid job_name ({{task.task-id}}); the name must consist of only the characters [-a-z0-9], starting with a letter and ending with a letter or number ; 497143)

It seems {{task.task-id}} is not resolved, to task-id.

Upgrading then to Airflow 2.7 gives the same result.

Workaround: harcode "job_name": "ZZZ" in dataflow_config property.

What you think should happen instead

job_name should be automatically resolved from task_id as it was in 2.5.

How to reproduce

Run any BeamRunJavaPipelineOperator on 2.6+ without setting job_name.

Anything else

I am not at all familiar with Airflow internals, so providing a PR would take a lot of time.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 3 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

RNHTTR commented 3 months ago

Is there a difference in the version of apache-airflow-providers-google before and after the upgrade?

turb commented 3 months ago

@RNHTTR from the documentation I understand it was upgraded from 10.12.0 to 10.17.0, however I am unsure of it since I did not keep track of the precise version before upgrading (and neither GCP has a log of it).

e-galan commented 2 months ago

The parsing of Dataflow job_name from the task_id parameter stopped working after #37934 , where the parsing of the dataflow_config parameter was moved from the operator's __init__() method into __execute__(). Apparently it was done to resolve issues in the unit tests that were caused by the use of Airflow's templating syntax in __init__() .

I should also note that this feature, where the job name was copied from the task ID is somewhat unique to the Apache Beam operator, as normally task id and job name are kept separate.

e-galan commented 2 months ago

Submitted #40645 . It should resolve the issue @turb