AbsaOSS / atum

A dynamic data completeness and accuracy library at enterprise scale for Apache Spark
Apache License 2.0
29 stars 9 forks source link

Atum redesign #28

Open yruslan opened 4 years ago

yruslan commented 4 years ago

Background

Currently, Atum relies on the global state of a Spark Application. This complicates the usage of Atum for jobs that are slightly more complicated than just a pipeline of a single dataframe. If there are several dataframes and several reads/writes and not every read and write is associated with control measurements, Atum will try to process all dataframes as if all require measurements.

The current workaround for such use cases is disableControlMeasuresTracking() method that is invoked before writing a dataframe that does not require control measurements.

Feature

Additonal context

After the new design is confirmed this issue can be converted to epic and all subitems to tasks.

lokm01 commented 4 years ago

Makes sense.

AdrianOlosutean commented 4 years ago

I would also proporse to redesign some parts so that they are immutable and functional style. What do you think?

lokm01 commented 4 years ago

Absolutely.

benedeki commented 4 years ago

Not sure about the last one like its described, particularly in regard to the changes above. If the ATUM would be "attached" to a dataset, it would make sense to send a "last message" on that dataframe. But I am not sure there would be something to hook such an event reliably to. 🤔

yruslan commented 4 years ago

Yeah, it would probably be hard to implement an event that is sent last per dataset. But an event that is sent last during the lifetime of the application could be useful.

AdrianOlosutean commented 4 years ago

Fields such as Country and others should be made optional and only functional ones should be mandatory to include