apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.66k stars 14.19k forks source link

[Helm Chart] Support multiple DagProcessors #26494

Open jtvmatos opened 2 years ago

jtvmatos commented 2 years ago

Description

Airflow 2.4.0 has introduced support for Support multiple DagProcessors parsing files from different locations.

However, Helm chart only supports Provision Standalone Dag Processor (or at least will support in Airflow Helm Chart 1.7.0).

Use case/motivation

For Airflow Helm Chart users take advantage of these new AIP-43 features, it is necessary to add support for provision multiple DagProcessors in Helm chart.

Related issues

Helmchart: Provision Standalone Dag Processor Support multiple DagProcessors parsing files from different locations

Are you willing to submit a PR?

Code of Conduct

potiuk commented 2 years ago

Marked it as a good first issue. I think it would require some smart way of defining how to do it in the configuration. Even if you are not willling to implement it (or maybe?), a proposal from the user perspective how it could look like when it comes to configuration would be most welcome @jtvmatos . Any ideas and suggestions?

jtvmatos commented 2 years ago

Hi @potiuk thanks for the feedback.

Unfortunately in the coming weeks I won't be available to create this PR (I will be unavailable for more than a month... if no one solves it by then, I can open the PR....)

But yes, this PR can be tricky and can create a breaking change in the current processor configurations. For example, creating a nested level for multiple DagProcessors could be an option:

# Airflow Dag Processor Config
dagProcessor:
  enabled: false
  # Airflow dag processors in the deployment
  processor:
    # Processor 1
    - name: processor-1 # Alias for metadata.name
      replicas: 1
      revisionHistoryLimit: ~
      command: ~
      args: ["bash", "-c", "exec airflow dag-processor"]
      strategy:
     (...)
    # Processor 2
    # - name: processor-2
    #   args: ["bash", "-c", "exec airflow dag-processor --subdir  ..."]

With this values.yaml, will be possible to set the processors configurations as currently used for env or secrets, i.e., dagProcessor.processor[0]..., dagProcessor.processor[1]... for N processors (>=2.4.0).

Alternatively, we can just add a extraDagProcessor list:

# Airflow Dag Processor Config
dagProcessor:
  enabled: false

   (...)

  env: []

  extraDagProcessor: []

This list is not breaking for the current settings. However, it almost imply a "master" processor and "other" processors.

potiuk commented 2 years ago

No worries - we do not have to add it super-quickly I think (but if somoene will want to add it this is still an option :)).

I'd say the best approach would be to make the top-level dag processor entry optional (and fail the configuration if both "arrays" and "top level" are specified (then we could name the array "multipleDagProcessors" simply?). WDYT @jedcunningham @mhenc ?

jtvmatos commented 2 years ago

I create a fork, with the approach that I mention before (only for my local tests): -> https://github.com/apache/airflow/commit/0793c55be0f63d8da53d16c1097bc274bbcc1a69

For now this is all I can do. But it can be useful for the discussion :)

Thank you all!

jedcunningham commented 2 years ago

We have a similar problem for celery workers too. I haven't given it a ton of thought, but I feel something like the breaking change as proposed above will be a better solution for us. That does mean we need to wait for version 2 of the chart though.

Either way, removing it from the existing milestone for now.

jtvmatos commented 2 years ago

Hi @jedcunningham,

Should we then follow a approach like @potiuk suggested? Something like:

# Airflow Dag Processor Config
dagProcessor:
  enabled: false
  # Number of airflow dag processors in the deployment
  replicas: 1

  (...)

# Airflow Multiple Dag Processor Config
multipleDagProcessor:
  enabled: false
  processor:
    # Processor 1
    - name: processor-1
      replicas: 1
     (...)
    - name: processor-2
      args: ["bash", "-c", "exec airflow dag-processor --subdir  ..."]
     (...)

With a validation that only one (dagProcessor or multipleDagProcessor) can be activated, and a Note that in Helm Chart v2 both configs will be merge. WDYT?

eladkal commented 1 year ago

@jtvmatos I think it's better to open a draft PR and discuss code changes over the PR

apinchuk1 commented 5 months ago

What is the current status of this? Looking forward to the feature of having multiple dag folders within the helm chart.

potiuk commented 5 months ago

What is the current status of this? Looking forward to the feature of having multiple dag folders within the helm chart.

If you would like to contribute it @apinchuk1 , this this is the most certain way of making a "status change" on that. Other than - someone will have to contribute it, so there is no "status change".