delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

write_deltalake fails on Databricks volume #2540

Closed Bernolt closed 1 month ago

Bernolt commented 1 month ago

Environment

Delta-rs version: 0.17.4

Binding: python (pyarrow engine)

Environment:


Bug

What happened: From a python application running on a Databricks cluster, I want to write to an append-only delta table. The function is called as follows:

write_deltalake(
      data=arrow_table,
      table_or_uri="/Volume/catalog/schema/volume_path/table_path",
      mode="append",
      overwrite_schema=False)

However, I am getting the below error:

OSError: Generic LocalFileSystem error: Unable to copy file from /Volumes/catalog/schema/volume_path/table_path/_delta_log/_commit_e964ab56-f56c-403a-b06d-fe2b6bcabf9d.json.tmp to /Volumes/catalog/schema/volume_path/table_path/_delta_log/00000000000000000000.json: Function not implemented (os error 38)

What you expected to happen: As Databricks supports copy/rename/delete operations, I would expect it to work. As far as I know Databricks use a Local File System API, which emulates a filesystem on top of a cloud storage.

How to reproduce it: I made the below notebook to reproduce the error. It needs to be run from a Databricks Runtime.

# Databricks notebook source
# MAGIC %sh
# MAGIC touch /Volumes/catalog/schema/volume/table_path/to_rename.tmp

# COMMAND ----------

# MAGIC %sh
# MAGIC mv /Volumes/catalog/schema/volume/table_path/to_rename.tmp /Volumes/catalog/schema/volume/table_path/renamed.todelete

# COMMAND ----------

# MAGIC %sh 
# MAGIC rm /Volumes/catalog/schema/volume/table_path/renamed.todelete

# COMMAND ----------

from deltalake import write_deltalake
import pyarrow as pa

# COMMAND ----------

arrow_table = pa.table([
    pa.array([2, 4, 5, 100]),
    pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
    ], names=['n_legs', 'animals'])

# COMMAND ----------

write_deltalake(table_or_uri = "/Volumes/catalog/schema/volume/table_path/reproduce_deltars_error_table_01",
                data = arrow_table,
                mode = "append",
                overwrite_schema=False)

More details:

Bernolt commented 1 month ago

It might not be a bug from delta-rs perspective, however, it would be helpful to have some insights on the underlying file system operation performed.

ion-elgreco commented 1 month ago

Afaik, databricks volumes are fuse mounted, so this is not an bug. If you want to write to mounted storage that doesn't support CopyIfNotExists, you can pass this to the writer:

storage_options = {"allow_unsafe_rename": "true"}

Bernolt commented 1 month ago

Thanks, solved my issue.