apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
342 stars 69 forks source link

Support reading Avro files in zstd codec #349

Open siumingdev opened 1 year ago

siumingdev commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. I would like to read Avro files in ztsd codec, like this

from datafusion import SessionContext

ctx = SessionContext()
ctx.read_avro("/path/to/my/avro/in/zstd/codec")

But currently it gives the following error:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[4], line 4
      1 from datafusion import SessionContext
      3 ctx = SessionContext()
----> 4 ctx.read_avro("/path/to/my/avro/in/zstd/codec")

Exception: DataFusion error: AvroError(CodecNotSupported("zstandard"))

I am running the code using official Python 3.9 docker image (python:3.9-slim) and install using pip install datafusion.

Describe the solution you'd like No idea, is it even not supported in the original Rust implementation?

Describe alternatives you've considered Read the file into using other Avro libraries and convert into datafusion dataframes.

Additional context

mesejo commented 1 year ago

As you mentioned, this is not supported by the Rust Implementation (see the reading options). Is better to move the ticket to the appropriate repo. What are your thoughts @alamb @andygrove?

alamb commented 1 year ago

I agree -- this is a problem lower down in the stack (either in DataFusion or avro_rs)

I did a brief look at avro_rs and didn't see any mention of zstd 🤔 https://docs.rs/avro-rs/latest/avro_rs/#using-codecs-to-compress-data