apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.12k stars 14.3k forks source link

DatasetManager.register_dataset_change(): Fall back to datasets' "extra" if arg extra is empty #37006

Closed Blizzke closed 8 months ago

Blizzke commented 9 months ago

Description

I would like to know if it is possible to change the behavior of the register_dataset_change function to copy the extra from the specified dataset if the actual extra argument is empty?

Use case/motivation

We have a lot of datasets where we, within the dataset, have versions of the data. Because - for lineage etc - we want to keep the URI for all of those the same (it is the same dataset after all), we use the extra argument to pass along the version number etc. Unfortunately, after a change is registered, we lose that information.

It seems logical to us that when no extra is specified (which is actually impossible as far as we can see, since the current calling locations don't even allow it), that you fall back to the extra of the dataset. Our override is as simple as

class DatasetManager(OriginalDatasetManager):
    def register_dataset_change(self, *, task_instance: TaskInstance, dataset: Dataset, extra=None, session: Session,
                                **kwargs) -> None:
        if extra is None and dataset.extra:
            extra = dataset.extra

        super().register_dataset_change(task_instance=task_instance, dataset=dataset, extra=extra, session=session,
                                        **kwargs)

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

uranusjr commented 8 months ago

See https://github.com/apache/airflow/issues/37810 and issues and PRs linked in it (including in comments). Doing this fallback would go against the original design vision.

Blizzke commented 8 months ago

I’m sorry but if I understand the original discussion that is completely different from what we try to accomplish.

I view the extra as a way to quantify the uri more. The uri identifies the dataset, the extra contains specific information on how/what was used in/from that dara when the task was triggered. In our case this is usually the data version.

I obviously can’t force you to implement it like this, but for us the uri identifies which dataset and triggers any scheduling, the extra identifies exactly which version of the data was used and is unique to every task that uses it as in/outlet. So I would argue that data is part of the dataset and not xcom.

potiuk commented 8 months ago

I don't think it's the matter of forcing anyone or convincing single person.

Looks like you just want to propose a different way of treating extras - or maybe a completely new feature of dataset. And the right way of doing it @Blizzke is as usual - open a devlist discussion (not a github issue) where you explain your rationale, present your proposal, convince people at the devlist to your idea and in case of a bigger change write an Airflow Improvement Proposal describing your - apparently - design change (or addition) to the DataSet feature. Then you either are able to reach consensus that your idea is good. or when you can't reach consensus - you call for a vote.

This is how it works when you propose a design change to an important feature of Airflow that impact everyone using it - there is no other way. You will find all the links in https://airflow.apache.org/community/

It's entirely up to the arguments you present and how convincing you are and how complete your proposal will be to get others to agree to it. It's entirely in your hands, you just need to put an effort to convince those from the community who will take part in the discussion.