apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 63 forks source link

ADD gz decompression in parallel like pigz #624

Open vchemla opened 3 months ago

vchemla commented 3 months ago

Hi,

In our case, we would like to read a big CSV file compressed in .gz format.

We would like to use the read_csv function like this:

ctx.read_csv('myfile.csv.gz',file_extension=".csv.gz", delimiter=';', has_header=True, schema_infer_max_records=0, file_compression_type='gzip')

However, this decompression is not parallel like pigz (54 seconds) compared to 800 seconds when using the read_csv function.

If you could take a look...