Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.87k stars 118 forks source link

Issue with read_hudi on Windows due to backslashes in S3 URIs #2295

Closed kirillklimenko closed 1 month ago

kirillklimenko commented 1 month ago

Describe the bug

When using daft.read_hudi on Windows with an S3 URI, a FileNotFoundError is thrown. This is likely due to the use of backslashes ("\") in the os.path.join with S3 URI on Windows.

FileNotFoundError: File: s3://bucket/test\year=2024/month=05/day=17/24ebd153-6cf9-425f-88e9-91ca243bf973-0_2-41-1195_20240523052231633.parquet not found

FileNotFoundError: File: s3://bucket/test\.hoodie\hoodie.properties not found

To Reproduce

Steps to reproduce the behavior:

  1. Run the following code on a Windows machine:
import daft

config = daft.io.IOConfig(s3=daft.io.S3Config(region_name="us-east-2"))
df = daft.read_hudi("s3://bucket/test", io_config=config)
df.show()

Expected behavior The daft.read_hudi function should be able to read from the specified S3 URI without throwing a FileNotFoundError.

Environment

OS: Windows 11 Daft version: 0.2.24 Python version: 3.10

Additional context

This issue does not occur on Linux or macOS, as these systems use forward slashes ("/") in file paths. A potential fix for this issue could be to always use forward slashes when constructing S3 URIs, regardless of the operating system.

Recommend library: pathlib.

jaychia commented 1 month ago

Oh yikes -- good catch! We might have to fix some of the pyhudi code (cc @xushiyan)

colin-ho commented 1 month ago

Hey @kirillklimenko, I just merged in a fix for this, should be good to go in next release!