delta-incubator / delta-kernel-rs

A native Delta implementation for integration with any query engine
Apache License 2.0
114 stars 32 forks source link

Can't always find partition constants in map #232

Open samansmink opened 3 months ago

samansmink commented 3 months ago

Using duckdb_delta looking up partition constants is not always working, I'm not sure why. Some tests are passing some are failing:

(Passing) Data generated using delta-rs

This test passes. Test is here Data generated here DuckDB can correctly find the partition constant for each file using ffi::get_from_map in ffi::visit_scan_data callback

(Failing) Test with delta_kernel/kernel/tests/data/basic_partitioned

When DuckDB calls ffi::get_from_map in the callback from ffi::visit_scan_data, the letter column is not found. test is here

(Failing) Test with delta_kernel/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned

Same thing as with 2, the lookup with get_from_map returns NULL even though the column should be there?

@nicklan let me know if you need anything more here

nicklan commented 3 months ago

This is basically caused by letter being NULL in one of the partitions of that table (the __HIVE_DEFAULT_PARTITION in particular).

The value not being in the map is exactly what indicates that it should be NULL in the output.

It seems that when you propagate to the multi-file-reader, since the constant isn't specified, it tries to get that column out, and that's what causes the odd:

IO Error: Failed to read file "../delta-kernel-rs/kernel/tests/data/basic_partitioned/letter=__HIVE_DEFAULT_PARTITION__/part-00000-8eb7f29a-e6a1-436e-a638-bbf0a7953f09.c000.snappy.parquet": schema mismatch in glob: column "letter" was read from the original file "../delta-kernel-rs/kernel/tests/data/basic_partitioned/letter=__HIVE_DEFAULT_PARTITION__/part-00000-8eb7f29a-e6a1-436e-a638-bbf0a7953f09.c000.snappy.parquet", but could not be found in file "../delta-kernel-rs/kernel/tests/data/basic_partitioned/letter=__HIVE_DEFAULT_PARTITION__/part-00000-8eb7f29a-e6a1-436e-a638-bbf0a7953f09.c000.snappy.parquet".

There is indeed no letter column, it should be filled in with NULL.

I see that the constant_map in the visit_callback is a <string> map, so not sure if we can indicate in there that a column should be null. Might need to somehow extend it to be <Value> map, or some other way to indicate to the reader that it should fill the column with NULL