Atum redesign - Githubissues

yruslan commented 4 years ago

Background

Currently, Atum relies on the global state of a Spark Application. This complicates the usage of Atum for jobs that are slightly more complicated than just a pipeline of a single dataframe. If there are several dataframes and several reads/writes and not every read and write is associated with control measurements, Atum will try to process all dataframes as if all require measurements.

The current workaround for such use cases is disableControlMeasuresTracking() method that is invoked before writing a dataframe that does not require control measurements.

Feature

[ ] Control measurements should be attached to a dataframe, not to the Spark session. E.g., to turn on control measurements users should do df.enableControlMeasuresTracking() instead of spark. enableControlMeasuresTracking(). Same for switching off control measurements.
[ ] The measurements should happen to the dataframe it was initialized and the derived ones. Other dataframes shouldn't be affected.
[ ] Checkpoints and other housekeeping information should not be kept in the global state.
[ ] Adding metadata should be done as dataframe implicits (e.g. df.setAdditionalInfo(...)).
[ ] Atum should keep checkpoints for each registered dataframe separately.
[ ] Atum plugins should have an event that guaranteed to be sent last. Atum should guarantee that no more events are sent after that.

Additonal context

After the new design is confirmed this issue can be converted to epic and all subitems to tasks.

lokm01 commented 4 years ago

Makes sense.

AdrianOlosutean commented 4 years ago

I would also proporse to redesign some parts so that they are immutable and functional style. What do you think?

lokm01 commented 4 years ago

Absolutely.

benedeki commented 4 years ago

Not sure about the last one like its described, particularly in regard to the changes above. If the ATUM would be "attached" to a dataset, it would make sense to send a "last message" on that dataframe. But I am not sure there would be something to hook such an event reliably to. 🤔

yruslan commented 4 years ago

Yeah, it would probably be hard to implement an event that is sent last per dataset. But an event that is sent last during the lifetime of the application could be useful.

AdrianOlosutean commented 4 years ago

Fields such as Country and others should be made optional and only functional ones should be mandatory to include

AbsaOSS / atum

Atum redesign #28

Background

Feature

Additonal context