AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

Pass an instance of HyperdriveContext instead of accessing singleton object #122

Closed kevinwallimann closed 4 years ago

kevinwallimann commented 4 years ago

In #114 , HyperdriveContext was introduced to maintain state across components. To avoid a breaking change it was introduced as a singleton object. However, this requires the state to be stored in a var or a mutable map.

Now, HyperdriveContext should be a class that is passed to the components (in the read, transform, write method etc.). Then, the state map can be an immutable val.

felipemmelo commented 4 years ago

Quick comment on this since it was mentioned in CQC.

It seems the introduction of the context is to get rid of ConfigurationKeys required everywhere, among other things. So if it is passed as an argument, that purpose is defeated.

Also, usually, the idea of a context is to have something accessible by everyone inside the same JVM context (e.g. Tomcat, Spring, etc). So having it as a Singleton is probably the best choice since it will reflect the state at a given time for the whole system.

Finally, having a var is definitely not wrong, since the variable is not escaping the context of its wrapping object. In this case it seems design is at odds with a loose convention of the language, and I would argue that the former is way more important than the latter.

But again, only mentioning it here since it was brought around in the CQC session.

kevinwallimann commented 4 years ago

The original necessity for the context is from #114 . There, the writer needs to know which columns of the dataframe are the key columns. This information comes from the reader. Having a mutable state is not necessary, but easier to implement and doesn't require a breaking change. But in Hyperdrive, the information flow is linear. From reader, to decoder etc. to the writer, so you could pass and return the context from each method (read, decode, transform etc.).

If you write a component, you need to understand the whole Hyperdrive to understand that HyperdriveContext is indeed not accidentally modified by another thread in your method. If it's passed as a variable and return again, then the scope of the context is really just your method, which is easier to reason about imo, at the cost of some verbosity