The Kartothek>=4 behavior of this code is to corrupt the dataset:
Traceback (most recent call last):
File "test_kartothek_create_update.py", line 46, in <module>
validate()
File "test_kartothek_create_update.py", line 35, in validate
ddf = read_dataset_as_ddf(dataset_uuid, store, "predictions")
File "<decorator-gen-7>", line 2, in read_dataset_as_ddf
File "/Users/lgtf/git/kartothek-fork/kartothek/io_components/utils.py", line 277, in normalize_args
return _wrapper(*args, **kwargs)
File "/Users/lgtf/git/kartothek-fork/kartothek/io_components/utils.py", line 275, in _wrapper
return function(*args, **kwargs)
File "/Users/lgtf/git/kartothek-fork/kartothek/io/dask/dataframe.py", line 113, in read_dataset_as_ddf
delayed_partitions = read_dataset_as_delayed(
File "/Users/lgtf/git/kartothek-fork/kartothek/io/dask/delayed.py", line 239, in read_dataset_as_delayed
mps = read_dataset_as_delayed_metapartitions(
File "<decorator-gen-5>", line 2, in read_dataset_as_delayed_metapartitions
File "/Users/lgtf/git/kartothek-fork/kartothek/io_components/utils.py", line 277, in normalize_args
return _wrapper(*args, **kwargs)
File "/Users/lgtf/git/kartothek-fork/kartothek/io_components/utils.py", line 275, in _wrapper
return function(*args, **kwargs)
File "/Users/lgtf/git/kartothek-fork/kartothek/io/dask/delayed.py", line 217, in read_dataset_as_delayed_metapartitions
return list(mps)
File "/Users/lgtf/git/kartothek-fork/kartothek/io_components/read.py", line 102, in dispatch_metapartitions_from_factory
yield MetaPartition.from_partition(
File "/Users/lgtf/git/kartothek-fork/kartothek/io_components/metapartition.py", line 426, in from_partition
file=partition.files[table_name],
KeyError: 'predictions'
Expected Behavior:
Based on these changelog entries for Kartothek 4.0 I would expect Kartothek to infer the correct table name if left out.
All read pipelines will now automatically infer the table to read such that it is no longer necessary to provide table or table_name as an input argument
All writing pipelines which previously supported a complex user input type now expose an argument table_name which can be used to continue usage of legacy datasets (i.e. datasets with an intrinsic, non-trivial table name). This usage is discouraged and we recommend users to migrate to a default table name (i.e. leave it None / table)
However replicating the Kartothek<4 behavior would also be acceptable to me.
This is a follow up issue to #445 and #451. Currently datasets can still be corrupted if the table name diverges between the creation and update step.
Snippet:
The
Kartothek<4
behavior of this code is to raise aTypeError
:Files:
The
Kartothek>=4
behavior of this code is to corrupt the dataset:Files:
Expected Behavior: Based on these changelog entries for Kartothek 4.0 I would expect Kartothek to infer the correct table name if left out.
However replicating the
Kartothek<4
behavior would also be acceptable to me.