read_dataset_as_ddf does not return stored datetime index #129

Open MichalChromcak opened 5 years ago

MichalChromcak commented 5 years ago

Problem description

When reading kartothek dataset with read_dataset_as_ddf I am losing original datetime index when stored with update_dataset_from_ddf. Even though children parquet files in kartothek datasets' directory still keep the index as datetime. Can you please take a look on that? Original data comes from machine sensors, being mocked here:

Example code (ideally copy-pastable)

import pandas as pd
from functools import partial
from storefact import get_store_from_url
from kartothek.io.dask.dataframe import update_dataset_from_ddf
from kartothek.io.dask.dataframe import read_dataset_as_ddf
import dask.dataframe as dd

# Create dask dataframe
ddf = (dd.from_pandas(
        .assign(date=lambda x:x['dateTime'].dt.date)

Check index type


Output being correctly datetime64[ns]

Dask Index Structure:
2019-08-15 23:59:58    datetime64[ns]
2019-08-15 23:59:59               ...
Name: dateTime, dtype: datetime64[ns]
Dask Name: sort_index, 7 tasks

Creating dataset and reading it back

store = './karto/'
store_factory = partial(get_store_from_url, "hfs://" + store)

                        table = 'table',

ds = read_dataset_as_ddf(dataset_uuid='test_data',

Checking index


Outputs incorrectly int64[ns]

Dask Index Structure:
dtype: int64
Dask Name: from-delayed, 4 tasks

While reading dd.read_parquet one of the children parquets of kartothek directory keeps the datetime index


Output being correctly datetime64[ns]

Dask Index Structure:
Name: dateTime, dtype: datetime64[ns]
Dask Name: read-parquet, 2 tasks

Used versions

lr4d commented 5 years ago

Index information isn't supported by kartothek at the moment, so this behavior (i.e. index information being "lost" after read/write) should be expected.

I am not sure why the Parquet files do contain index information, maybe someone else can comment on that.

MichalChromcak commented 5 years ago

@lr4d Thank you for quick reply. I'll assume losing that information as expected behavior. Ideas what it would take to reconstruct it from parquets (assuming they have index stored as in above case)?

lr4d commented 5 years ago

The easiest thing would be to just convert the index to a column. Otherwise, you might want to take a look at https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/serialization/_parquet.py#L86, and possibly add a flag to keep index information, but I'm not sure if that would work with the rest of the code.

crepererum commented 5 years ago

Let me add some details to the discussion:

fjetter commented 5 years ago

This is actually somewhat unrelated to the fact that we're not recovering the parquet indices since this issue is about (re-)constructing a dask index which is a bit more elaborate but I actually see the possibility of including this to kartothek.

What happens under the hood when you call set_index on a Dask Dataframe is that dask calculates all unique values (similar to our ktk indices) for the column and rearranges the data accordingly. There are some implicit guarantees on the partitions and divisions (not part of the public API but unlikely to change). The data is actually sorted by the index key s.t. the dataframe revisions (data interval boundaries of the partitions) are sorted and unique, e.g. see https://stackoverflow.com/questions/49905306/why-do-dask-divisions-need-to-be-unique.

Using this attribute it is, for example, possible to set/construct a dask index based on kartothek index information (code existed but was never merged since we didn't need it, yet). This would be a not so difficult first step but not exactly what is requested in this issue. But I would similarly expect that we can also reconstruct a previously existing index based on min/max stats and local (partition wise) index setting. This is, however, a bit more complicated.

As a (not well performing) substitute, I recommend to reset the index, store the Dask Dataframe and set the index after reading. This way, the partitioning information of the index still stays intact and I'd expect the second index setting is a bit faster than usual since the data is already arranged as it's supposed to be. I haven't tried this, though.