apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.87k stars 3.38k forks source link

add support for Fabric OneLake, it is already supported by Delta_rs #38726

Open djouallah opened 7 months ago

djouallah commented 7 months ago

Describe the enhancement requested

Delta_rs added support for Fabric OneLake recently, it will be nice to add the support for pyarrow dataset to read parquet and csv etc from OneLake Fabric

https://github.com/delta-io/delta-rs/pull/1642

currently I am using this code to read from dataset and save ad delta table but it works only with local path

import pyarrow.dataset as ds
from deltalake.writer import write_deltalake
aadToken = mssparkutils.credentials.getToken('storage')
storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"}
sf=100
rowgroup           = 2000000
nbr_rowgroup_File  = 8 * rowgroup
for tbl in ['lineitem','nation','region','customer','supplier','orders','part','partsupp'] :
     print(tbl)
     dataset = ds.dataset(f'/lakehouse/default/Files/{sf}/{tbl}',format="parquet")
     write_deltalake(f"abfss://Rust@onelake.dfs.fabric.microsoft.com/test.Lakehouse/Tables/{tbl}"\
     ,dataset\
     ,mode='overwrite',overwrite_schema=True,max_rows_per_file =nbr_rowgroup_File,min_rows_per_group=rowgroup,max_rows_per_group=rowgroup\
     ,storage_options=storage_options)

Component(s)

Format, Integration, Python

AlenkaF commented 7 months ago

As the support for Azure Blob Storage is being in progress in Apache Arrow C++ (https://github.com/search?q=repo%3Aapache%2Farrow+%5BC%2B%2B%5D%5BFS%5D%5BAzure%5D&type=issues) and it will be available in Python as a follow-up also, would that be something we can use to read from OneLake Fabric?