JDASoftwareGroup / kartothek

A consistent table management library in python
https://kartothek.readthedocs.io/en/stable
MIT License
161 stars 53 forks source link

Inconsistent type of kwarg 'store' across write and update #61

Open kagharpure opened 5 years ago

kagharpure commented 5 years ago

The eager write functions appear to expect the argument supplied to store to directly be a store object, whereas the update function appears to expect a factory (python callable) - can it be standardized one way or another please?


   import numpy as np
   import pandas as pd
   from functools import partial
   from storefact import get_store_from_url
   from tempfile import TemporaryDirectory
   from kartothek.io.eager import store_dataframes_as_dataset
   from kartothek.io.eager import update_dataset_from_dataframes

   df = pd.DataFrame(
       {
           "A": 1.,
           "B": pd.Timestamp("20130102"),
           "C": pd.Series(1, index=list(range(4)), dtype="float32"),
           "D": np.array([3] * 4, dtype="int32"),
           "E": pd.Categorical(["test", "train", "test", "train"]),
           "F": "foo",
       }
   )

   dataset_dir = TemporaryDirectory()
   store = get_store_from_url(f"hfs://{dataset_dir.name}") 

   dm = store_dataframes_as_dataset(
      store, #store object works fine here  
      "a_unique_dataset_identifier", 
      df, 
      metadata_version=4
   )

   another_df = pd.DataFrame(
       {
           "A": 2.,
           "B": pd.Timestamp("20190604"),
           "C": pd.Series(2, index=list(range(4)), dtype="float32"),
           "D": np.array([6] * 4, dtype="int32"),
           "E": pd.Categorical(["test", "train", "test", "train"]),
           "F": "bar",
       }
   )

   store_factory = partial(get_store_from_url, f"hfs://{dataset_dir.name}")

   dm = update_dataset_from_dataframes(
       [another_df],
       store=store_factory, #but this needs to be a callable
       dataset_uuid="a_unique_dataset_identifier"
       )
   dm
fjetter commented 5 years ago

Usually kartothek expects proper store factories. We only make an exception for the eager implementation. We're using the utils._make_callable function https://github.com/JDASoftwareGroup/kartothek/blob/3f7344fc29dc154923dae81d5352fa34698f7059/kartothek/io_components/utils.py#L345 This is currently done for a few functions, e.g. https://github.com/JDASoftwareGroup/kartothek/blob/3f7344fc29dc154923dae81d5352fa34698f7059/kartothek/io/eager.py#L125

kagharpure commented 5 years ago

Thanks @fjetter. Out of curiosity and for the wider user-base, what's the reasoning behind the choice of store factories?

lr4d commented 5 years ago

@fjetter Why is an exception made for eager? A user would expect this to be consistent across back-ends

lr4d commented 5 years ago

eager's garbage_collect_dataset also assumes a callable. Admittedly, I implemented both these functions

xhochy commented 5 years ago

Thanks @fjetter. Out of curiosity and for the wider user-base, what's the reasoning behind the choice of store factories?

Store objects encapsulate connections to a storage service. In the methods that have a distributed computing backend, we pass the function arguments via pickle to the other workers. While pickle can preserve the state of the attributes of an object, the connections it holds are no longer valid / cannot be transferred between processes. Thus we pass callables so that on each worker a new connection can be instantiated.

fjetter commented 5 years ago

@lr4d We make an exception for eager for convenience since there is no network traffic involved and passing the store directly is safe. We're using the _make_callable function to ensure that nothing is pickled.

kagharpure commented 5 years ago

@xhochy, @fjetter - in light of your responses do the following follow-up actions sound good?

1) document the reasoning behind the choice somewhere appropriate (I'm thinking the getting started guide) 2) close this issue

lr4d commented 5 years ago

@kagharpure AFAIK we're passing a store factory in the getting started guide, so I would leave that as is to not confuse new users too much

kagharpure commented 5 years ago

@lr4d - Agreed. After posting that comment, I realized that there's already an issue (#44) that's about documenting store factories; so maybe adding a Gotchas document a bit further down the line will be a good idea, which can have a section on store factories and the reasoning behind them (as well as pitfalls, best practices, etc).