Closed aditya-raval-genea closed 3 months ago
Hi @abyssnlp I hope you are doing well. Are you guys are going to plan releasing fix for this issue?
Hi @aditya-raval-genea,
I'm not a maintainer but I did go through the source and it looks like the Python bindings rely on HudiTable
so datafusion is not the issue here. It should return a List[pyarrow.RecordBatch]
when querying via python.
I tried it out and was able to query S3.
.env
with the relevant env vars:
AWS_ACCESS_KEY_ID="abc"
AWS_SECRET_ACCESS_KEY="123abc"
AWS_DEFAULT_REGION="eu-central-1"
from hudi import HudiTable
import pyarrow as pa
import duckdb
from dotenv import load_dotenv
load_dotenv()
table = HudiTable("s3://test-hudi-rs/v6_nonpartitioned/")
records = table.read_snapshot()
arrow_table = pa.Table.from_batches(records)
conn = duckdb.connect()
result = conn.execute("select * from arrow_table").arrow()
print(result)
It works for me and gives back the results I expect. Could you share the logs/exceptions that you encounter?
Hi @abyssnlp thanks for the prompt response, I am not getting any error messages nor any logs or data, my console hangs and even using CTRL + C I am not able to exit, I forcefully need to kill the terminal instance.
Hi @aditya-raval-genea . Could you please share the following details:
pyproject.toml
or requirements.txt
pyarrow
, duckdb
.env
Sure @abyssnlp here it is
Python version is
3.12.3
Requirements.txt
python-dotenv==1.0.1
opensearch_py==2.6.0
pyspark==3.5.1
ddtrace==2.9.3
hudi==0.1.0
duckdb==1.0.0
getdaft[all]==0.2.32
pandas==2.2.2
Flask==3.0.3
Flask-API==3.1
The redacted version of . env
AWS_ACCESS_KEY_ID="XXXX"
AWS_SECRET_ACCESS_KEY="XXXX"
AWS_DEFAULT_REGION="us-east-1"
Size of the dataset you're trying to read from s3 - I am trying to query 42000 records partitioned by customer_uuid=ABCD/year=2023/month=4/day=17
Hi @aditya-raval-genea,
Thanks for sharing the details.
I created an isolated environment with the configuration you'd shared (Python 3.12
and all the relevant packages with pinned versions), and I'm able to successfully read from the Hudi table.
I simulated a dataset with around 45k records on s3 using Hudi version 0.15.0
.
Here are my 2 cents:
customer_uuid
is not appropriate as partitioning should be done on low cardinality columns. If the cardinality is too high, you would have too many small files and there would be a huge file IO overhead. Try partitioning with the year/month/day
scheme.hudi-rs
to work for your use-case, it might also be feasible to use pyspark to read the Hudi table as it has excellent integration with it. You can read more about it here. Let me know if either of these work out for you.
@abyssnlp Thank you so much for your help and support. The same piece of code works well, when I use local hudi table, it only gives long wait when I try to connect with S3 hudi table. And I want to expose the API out of this so I would rather go with hudi-rs than pyspark.
Hi @aditya-raval-genea
Understood. So did you get back the expected results even though it took a long time? Network IO is slower than disk IO especially if for ex. the bucket is present in a different region further from you.
@abyssnlp thanks for helping look into this. looks like there isn't an issue loading the creds. closing this now.
Is there an existing issue for this?
Description of the bug
I am trying to connect to S3 bucket, but I am not able to connect with it. Refer this as .env file
AWS_ACCESS_KEY_ID = "XXX" AWS_SECRET_ACCESS_KEY = "XXX" AWS_DEFAULT_REGION = "us-east-1"
Refer this as python snippet
Steps To Reproduce
I am simply running the python file, I am able to see env file is printing appropriate data
Expected behavior
It should be able to print results on console.
P.S. It is working fine with local hudi data inside file system
Screenshots / Logs
No response
Software information
Additional context
No response