Open kagharpure opened 5 years ago
Usually kartothek
expects proper store factories. We only make an exception for the eager
implementation. We're using the utils._make_callable
function https://github.com/JDASoftwareGroup/kartothek/blob/3f7344fc29dc154923dae81d5352fa34698f7059/kartothek/io_components/utils.py#L345
This is currently done for a few functions, e.g. https://github.com/JDASoftwareGroup/kartothek/blob/3f7344fc29dc154923dae81d5352fa34698f7059/kartothek/io/eager.py#L125
Thanks @fjetter. Out of curiosity and for the wider user-base, what's the reasoning behind the choice of store factories?
@fjetter Why is an exception made for eager
? A user would expect this to be consistent across back-ends
eager
's garbage_collect_dataset
also assumes a callable. Admittedly, I implemented both these functions
Thanks @fjetter. Out of curiosity and for the wider user-base, what's the reasoning behind the choice of store factories?
Store objects encapsulate connections to a storage service. In the methods that have a distributed computing backend, we pass the function arguments via pickle
to the other workers. While pickle
can preserve the state of the attributes of an object, the connections it holds are no longer valid / cannot be transferred between processes. Thus we pass callables so that on each worker a new connection can be instantiated.
@lr4d We make an exception for eager
for convenience since there is no network traffic involved and passing the store directly is safe. We're using the _make_callable
function to ensure that nothing is pickled.
@xhochy, @fjetter - in light of your responses do the following follow-up actions sound good?
1) document the reasoning behind the choice somewhere appropriate (I'm thinking the getting started guide) 2) close this issue
@kagharpure AFAIK we're passing a store factory in the getting started guide, so I would leave that as is to not confuse new users too much
@lr4d - Agreed. After posting that comment, I realized that there's already an issue (#44) that's about documenting store factories; so maybe adding a Gotchas document a bit further down the line will be a good idea, which can have a section on store factories and the reasoning behind them (as well as pitfalls, best practices, etc).
The
eager
write functions appear to expect the argument supplied tostore
to directly be astore
object, whereas the update function appears to expect a factory (python callable) - can it be standardized one way or another please?