Unable to connect to S3 bucket Hudi Table

aditya-raval-genea commented 3 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Description of the bug

I am trying to connect to S3 bucket, but I am not able to connect with it. Refer this as .env file AWS_ACCESS_KEY_ID = "XXX" AWS_SECRET_ACCESS_KEY = "XXX" AWS_DEFAULT_REGION = "us-east-1"

Refer this as python snippet

debug = True
from dotenv import load_dotenv
load_dotenv()  # take environment variables from .env.
import os

from hudi import HudiTable
import pyarrow as pa
import pyarrow.compute as pc
import duckdb

print(os.environ)  # print env

hudi_table_cloud = HudiTable("s3://sample-bucket/data/")
# hudi_table_cloud = HudiTable("file:///Users/user1/data")

records_cloud = hudi_table_cloud.read_snapshot()
arrow_table = pa.Table.from_batches(records_cloud)

con = duckdb.connect()

duck_results = con.execute(
    """
    SELECT *
    from
    arrow_table
    """
).arrow()

print(duck_results)

Steps To Reproduce

I am simply running the python file, I am able to see env file is printing appropriate data

Expected behavior

It should be able to print results on console.

P.S. It is working fine with local hudi data inside file system

Screenshots / Logs

No response

Software information

Operating system: MacOS
Rust version: rustc 1.71.1
Project version: 0.1.0

Additional context

No response

aditya-raval-genea commented 3 months ago

Hi @abyssnlp I hope you are doing well. Are you guys are going to plan releasing fix for this issue?

abyssnlp commented 3 months ago

Hi @aditya-raval-genea,

I'm not a maintainer but I did go through the source and it looks like the Python bindings rely on HudiTable so datafusion is not the issue here. It should return a List[pyarrow.RecordBatch] when querying via python.

I tried it out and was able to query S3.

.env with the relevant env vars:

AWS_ACCESS_KEY_ID="abc"
AWS_SECRET_ACCESS_KEY="123abc"
AWS_DEFAULT_REGION="eu-central-1"

from hudi import HudiTable
import pyarrow as pa
import duckdb

from dotenv import load_dotenv
load_dotenv()

table = HudiTable("s3://test-hudi-rs/v6_nonpartitioned/")

records = table.read_snapshot()

arrow_table = pa.Table.from_batches(records)

conn = duckdb.connect()

result = conn.execute("select * from arrow_table").arrow()

print(result)

It works for me and gives back the results I expect. Could you share the logs/exceptions that you encounter?

aditya-raval-genea commented 3 months ago

Hi @abyssnlp thanks for the prompt response, I am not getting any error messages nor any logs or data, my console hangs and even using CTRL + C I am not able to exit, I forcefully need to kill the terminal instance.

abyssnlp commented 3 months ago

Hi @aditya-raval-genea . Could you please share the following details:

Python version
pyproject.toml or requirements.txt
Versions of pyarrow, duckdb
the redacted version of your .env
Size of the dataset you're trying to read from s3

aditya-raval-genea commented 3 months ago

Sure @abyssnlp here it is

Python version is

3.12.3

Requirements.txt

python-dotenv==1.0.1
opensearch_py==2.6.0
pyspark==3.5.1
ddtrace==2.9.3
hudi==0.1.0
duckdb==1.0.0
getdaft[all]==0.2.32
pandas==2.2.2
Flask==3.0.3
Flask-API==3.1

The redacted version of . env

AWS_ACCESS_KEY_ID="XXXX"
AWS_SECRET_ACCESS_KEY="XXXX"
AWS_DEFAULT_REGION="us-east-1"

Size of the dataset you're trying to read from s3 - I am trying to query 42000 records partitioned by customer_uuid=ABCD/year=2023/month=4/day=17

abyssnlp commented 3 months ago

Hi @aditya-raval-genea,

Thanks for sharing the details.

I created an isolated environment with the configuration you'd shared (Python 3.12 and all the relevant packages with pinned versions), and I'm able to successfully read from the Hudi table. I simulated a dataset with around 45k records on s3 using Hudi version 0.15.0.

Here are my 2 cents:

It might be that the partitioning scheme on customer_uuid is not appropriate as partitioning should be done on low cardinality columns. If the cardinality is too high, you would have too many small files and there would be a huge file IO overhead. Try partitioning with the year/month/day scheme.
I noticed you have pyspark as one of your dependencies. In case you don't manage to get hudi-rs to work for your use-case, it might also be feasible to use pyspark to read the Hudi table as it has excellent integration with it. You can read more about it here.

Let me know if either of these work out for you.

aditya-raval-genea commented 3 months ago

@abyssnlp Thank you so much for your help and support. The same piece of code works well, when I use local hudi table, it only gives long wait when I try to connect with S3 hudi table. And I want to expose the API out of this so I would rather go with hudi-rs than pyspark.

abyssnlp commented 3 months ago

Hi @aditya-raval-genea

Understood. So did you get back the expected results even though it took a long time? Network IO is slower than disk IO especially if for ex. the bucket is present in a different region further from you.

xushiyan commented 3 months ago

@abyssnlp thanks for helping look into this. looks like there isn't an issue loading the creds. closing this now.

apache / hudi-rs