apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[Python] Enable random access reading for Python file objects (if supported) #26920

Open asfimport opened 3 years ago

asfimport commented 3 years ago

arrow::py::PyReadableFile::ReadAt is being commented as thread-safe (it puts a lock on the underlying python file) and should thus allow random access in parallel code (for example, reading a subset (eg column) of a parquet file).

However, based on experimentation, it seems this doesn't work (eg with s3fs filesystem to read a specific parquet column

Reporter: Joris Van den Bossche / @jorisvandenbossche

Note: This issue was originally created as ARROW-11000. Please see the migration documentation for further details.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: It might also be an issue specific to PyFileSystem handlers. Because when adding some print statements in PyReadableFile::ReadAt, this is clearly called for a plain python file object:


In [3]: with open("test.parquet", "rb") as f:
   ...:     pq.read_table(f)
   ...: 
Calling PyReadableFile::ReadAt
Called seek successfully
....
asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: The reason I opened this issue is because I saw large download sizes when using s3fs to read a single column of a large parquet file (compared to using our own S3Filesystem, which downloaded only a little bit of data).

So the example basically is the following. But, with the mentioned print statements, it seems PyReadableFile::ReadAt also gets called in this case. So it might be rather an issue on s3fs side?


In [7]: import s3fs

In [8]: fs2 = s3fs.S3FileSystem(anon=True)

In [9]: pq.read_table('ursa-labs-taxi-data/2016/01/data.parquet', filesystem=fs2, columns=["passenger_count"])
Calling PyReadableFile::ReadAt
Called seek successfully
....
asfimport commented 3 years ago

Maarten Breddels / @maartenbreddels: did you check with passing default_fill_cache=False and default_block_size=1 to the s3fs ctor? I think it by default does too many smart things.