Closed vavison closed 5 years ago
In order to make this possible, we will need to rewrite the way caching works to be similar to the commit mechanism whereby calling cacheAsParquet
will mark a label as "to be cached" but the actual caching actions are added to the flow in prepareForExecution
. That way we will have access to the full DAG in order to discover how often labels are used.
Currently we can call
cacheAsParquet
to cache a label, avoiding its DAG being executed more than once when it is used by more than one action. It would be good to add a check that the label is indeed used by more than one downstream action, and remove the caching if it is not (as it ceases to be an optimisation at this point). We could also add a flag that will force all caching to take place, regardless of this check.