JDASoftwareGroup / kartothek

A consistent table management library in python
https://kartothek.readthedocs.io/en/stable
MIT License
161 stars 53 forks source link

Fix dataset corruption after updates (#445) #451

Closed stephan-hesselmann-by closed 3 years ago

stephan-hesselmann-by commented 3 years ago

When updating a dataset with a table name other than 'table', an additional table named 'table' is erroneously created. This corrupts the dataset. The issue was introduced after deprecating the table name feature in the 4.0.0 release. The root cause is not passing the table name as an argument within partition_on and add_metapartition, which leads to the default table name "table" being used.

Description:

Briefly describe the change of behavior

fjetter commented 3 years ago

Can we put an end-to-end test in, as well? create a dataset w/ custom table name, update it, ensure it is not corrupt? from this unit test it is not immediately clear that this fixes the problem

stephan-hesselmann-by commented 3 years ago

I believe the issue described in #445 should be fixed with these changes (scenario where the table name is non-default but equal for create and update). However, I foresee problems when the table name diverges between creation and update, i.e.

    # Create with table name "predictions"
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, table='predictions', partition_on=['date'])

    # Update with default table name
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, partition_on=['date'])

What whould the expected behavior be in such a case? Should the table name be inferred as "predictions"? I believe that will require more modifications to the code.

stehessel commented 3 years ago

What whould the expected behavior be in such a case? Should the table name be inferred as "predictions"? I believe that will require more modifications to the code.

I will create a follow up issue for that. Merging this for now as the build failure is already on master and unrelated to these changes.