apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.83k stars 14.25k forks source link

cluster config generation using cluster config YAML file. #17089

Open iostreamdoth opened 3 years ago

iostreamdoth commented 3 years ago

Description

Add cluster config generation from yaml file which is generated using gcloud dataproc clusters export command.

Use case / motivation

We are generating quite a lot of clusters using a cluster config which keeps on changing depending upon project to project. I wrote custom operator which reads yaml config and converts it to dict and then uses the same config to genreate cluster. Wondering if it will be useful if cluster config generator operator can accept a yaml config or can create a new operator.

Are you willing to submit a PR?

I do have a working code for converting yaml stored on gcs to dict.

Related Issues

There is no issue as such, it is a functionality which allows users to specify cluster config as yaml file.

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 3 years ago

Feel free to make a PR!

mik-laj commented 3 years ago

config generator

DataprocCreateClusterOperatorClusterGenerator class is a deprecated and we are not adding new features to it.

I think we can extend DataprocCreateClusterOperator to accept the file path in cluster_config parameter like Cloud Build operator. https://github.com/apache/airflow/blob/d268016a7a6ff4b65079f1dea080ead02aea99bb/airflow/providers/google/cloud/operators/cloud_build.py#L172-L175

iostreamdoth commented 3 years ago

you mean airflow.providers.google.cloud.operators.dataproc.DataprocCreateClusterOperator or you mean airflow.contrib.operators.dataproc_operator.DataprocClusterCreateOperator ? I thought this is going to be in providers package. @mik-laj

mik-laj commented 3 years ago

@iostreamdoth ClusterGenerator from providers is legacy/deprecated method to generate dataproc cluster configuration. You should use cluster_config parameter. For details, see: https://github.com/apache/airflow/blob/026ffe65d4738674512f691a56b922e82d0a2309/airflow/providers/google/cloud/operators/dataproc.py#L553-L559

iostreamdoth commented 3 years ago

Got it, I got confused with the Operator name earlier.
Thanks @mik-laj

eladkal commented 2 years ago

@iostreamdoth do you plan to finish the stale PR?