Open MichalChromcak opened 5 years ago
Index information isn't supported by kartothek at the moment, so this behavior (i.e. index information being "lost" after read/write) should be expected.
I am not sure why the Parquet files do contain index information, maybe someone else can comment on that.
@lr4d Thank you for quick reply. I'll assume losing that information as expected behavior. Ideas what it would take to reconstruct it from parquets (assuming they have index stored as in above case)?
The easiest thing would be to just convert the index to a column. Otherwise, you might want to take a look at https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/serialization/_parquet.py#L86, and possibly add a flag to keep index information, but I'm not sure if that would work with the rest of the code.
Let me add some details to the discussion:
RangeIndex
everywhere. if that's properly documented, I think this would confuse the user less then it is currently done__index...
) and some JSON metadata for the pandas-specific type information This is actually somewhat unrelated to the fact that we're not recovering the parquet indices since this issue is about (re-)constructing a dask index which is a bit more elaborate but I actually see the possibility of including this to kartothek.
What happens under the hood when you call set_index
on a Dask Dataframe is that dask calculates all unique values (similar to our ktk indices) for the column and rearranges the data accordingly. There are some implicit guarantees on the partitions and divisions (not part of the public API but unlikely to change). The data is actually sorted by the index key s.t. the dataframe revisions (data interval boundaries of the partitions) are sorted and unique, e.g. see https://stackoverflow.com/questions/49905306/why-do-dask-divisions-need-to-be-unique.
Using this attribute it is, for example, possible to set/construct a dask index based on kartothek index information (code existed but was never merged since we didn't need it, yet). This would be a not so difficult first step but not exactly what is requested in this issue. But I would similarly expect that we can also reconstruct a previously existing index based on min/max stats and local (partition wise) index setting. This is, however, a bit more complicated.
As a (not well performing) substitute, I recommend to reset the index, store the Dask Dataframe and set the index after reading. This way, the partitioning information of the index still stays intact and I'd expect the second index setting is a bit faster than usual since the data is already arranged as it's supposed to be. I haven't tried this, though.
Problem description
When reading kartothek dataset with
read_dataset_as_ddf
I am losing original datetime index when stored withupdate_dataset_from_ddf
. Even though children parquet files in kartothek datasets' directory still keep the index as datetime. Can you please take a look on that? Original data comes from machine sensors, being mocked here:Example code (ideally copy-pastable)
Check index type
Output being correctly datetime64[ns]
Creating dataset and reading it back
Checking index
Outputs incorrectly int64[ns]
While reading dd.read_parquet one of the children parquets of kartothek directory keeps the datetime index
Output being correctly datetime64[ns]
Used versions