CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Apache License 2.0
75 stars 16 forks source link

Only cache labels if they are actually used by more than one action further down the flow #54

Closed vavison closed 5 years ago

vavison commented 5 years ago

Currently we can call cacheAsParquet to cache a label, avoiding its DAG being executed more than once when it is used by more than one action. It would be good to add a check that the label is indeed used by more than one downstream action, and remove the caching if it is not (as it ceases to be an optimisation at this point). We could also add a flag that will force all caching to take place, regardless of this check.

vavison commented 5 years ago

In order to make this possible, we will need to rewrite the way caching works to be similar to the commit mechanism whereby calling cacheAsParquet will mark a label as "to be cached" but the actual caching actions are added to the flow in prepareForExecution. That way we will have access to the full DAG in order to discover how often labels are used.