delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

Support string and binary view types. #2613

Open ritchie46 opened 1 week ago

ritchie46 commented 1 week ago

Arrow has adopted the (IMO) much better binary and string view types. Supporting these would mean we could move data zero-copy to delta-rs. Currently it fails:

df = pl.DataFrame({"a": ["foo", "bar"]})
df.write_delta("/tmp/bar.bar", mode="overwrite")
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In[5], line 2
      1 df = pl.DataFrame({"a": ["foo", "bar"]})
----> 2 df.write_delta("/tmp/bar.bar", mode="overwrite")

File ~/code/polars/py-polars/polars/dataframe/frame.py:3968, in DataFrame.write_delta(self, target, mode, overwrite_schema, storage_options, delta_write_options, delta_merge_options)
   3965     delta_write_options["schema_mode"] = "overwrite"
   3967 schema = delta_write_options.pop("schema", None)
-> 3968 write_deltalake(
   3969     table_or_uri=target,
   3970     data=data,
   3971     schema=schema,
   3972     mode=mode,
   3973     storage_options=storage_options,
   3974     large_dtypes=True,
   3975     **delta_write_options,
   3976 )
   3977 return None

File ~/miniconda3/lib/python3.10/site-packages/deltalake/writer.py:519, in write_deltalake(table_or_uri, data, schema, partition_by, mode, file_options, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, overwrite_schema, schema_mode, storage_options, partition_filters, predicate, large_dtypes, engine, writer_properties, custom_metadata)
    514 else:
    515     file_options = ds.ParquetFileFormat().make_write_options(
    516         use_compliant_nested_type=False
    517     )
--> 519 ds.write_dataset(
    520     data,
    521     base_dir="/",
    522     basename_template=f"{current_version + 1}-{uuid.uuid4()}-{{i}}.parquet",
    523     format="parquet",
    524     partitioning=partitioning,
    525     # It will not accept a schema if using a RBR
    526     schema=schema if not isinstance(data, RecordBatchReader) else None,
    527     file_visitor=visitor,
    528     existing_data_behavior="overwrite_or_ignore",
    529     file_options=file_options,
    530     max_open_files=max_open_files,
    531     max_rows_per_file=max_rows_per_file,
    532     min_rows_per_group=min_rows_per_group,
    533     max_rows_per_group=max_rows_per_group,
    534     filesystem=filesystem,
    535     max_partitions=max_partitions,
    536 )
    538 if table is None:
    539     write_deltalake_pyarrow(
    540         table_uri,
    541         schema,
   (...)
    549         custom_metadata,
    550     )

File ~/miniconda3/lib/python3.10/site-packages/pyarrow/dataset.py:1030, in write_dataset(data, base_dir, basename_template, format, partitioning, partitioning_flavor, schema, filesystem, file_options, use_threads, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, file_visitor, existing_data_behavior, create_dir)
   1027         raise ValueError("Cannot specify a schema when writing a Scanner")
   1028     scanner = data
-> 1030 _filesystemdataset_write(
   1031     scanner, base_dir, basename_template, filesystem, partitioning,
   1032     file_options, max_partitions, file_visitor, existing_data_behavior,
   1033     max_open_files, max_rows_per_file,
   1034     min_rows_per_group, max_rows_per_group, create_dir
   1035 )

File ~/miniconda3/lib/python3.10/site-packages/pyarrow/_dataset.pyx:4010, in pyarrow._dataset._filesystemdataset_write()

File ~/miniconda3/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: string_view
ion-elgreco commented 1 week ago

@ritchie46 That's the pyarrow writer engine, we don't really have control there to change this. Our Rust writer relies on the arrow_cast crate, which supports utf8_view https://docs.rs/arrow-cast/52.0.0/src/arrow_cast/cast/mod.rs.html#709.

Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.

Also which pyarrow version did you use here?

ritchie46 commented 1 week ago

Also which pyarrow version did you use here?

Pyarrow 16

Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.

Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?

ion-elgreco commented 1 week ago

Also which pyarrow version did you use here?

Pyarrow 16

Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.

Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?

I understand:) On the rust side, we should allow utf8-view to be passed through. But now we always cast record batches with the delta schema that gets converted to an arrow schema.

There we don't have a way yet to allow delta string to either be arrow utf8 or large utf8 or utf8 view. It will always be arrow utf8, which is also a current problem if your source has a large arrow array that's too large..