Open asfimport opened 3 years ago
Joris Van den Bossche / @jorisvandenbossche:
It might also be an issue specific to PyFileSystem
handlers. Because when adding some print statements in PyReadableFile::ReadAt
, this is clearly called for a plain python file object:
In [3]: with open("test.parquet", "rb") as f:
...: pq.read_table(f)
...:
Calling PyReadableFile::ReadAt
Called seek successfully
....
Joris Van den Bossche / @jorisvandenbossche:
The reason I opened this issue is because I saw large download sizes when using s3fs
to read a single column of a large parquet file (compared to using our own S3Filesystem, which downloaded only a little bit of data).
So the example basically is the following. But, with the mentioned print statements, it seems PyReadableFile::ReadAt
also gets called in this case. So it might be rather an issue on s3fs side?
In [7]: import s3fs
In [8]: fs2 = s3fs.S3FileSystem(anon=True)
In [9]: pq.read_table('ursa-labs-taxi-data/2016/01/data.parquet', filesystem=fs2, columns=["passenger_count"])
Calling PyReadableFile::ReadAt
Called seek successfully
....
Maarten Breddels / @maartenbreddels:
did you check with passing default_fill_cache=False
and default_block_size=1
to the s3fs ctor? I think it by default does too many smart things.
arrow::py::PyReadableFile::ReadAt
is being commented as thread-safe (it puts a lock on the underlying python file) and should thus allow random access in parallel code (for example, reading a subset (eg column) of a parquet file).However, based on experimentation, it seems this doesn't work (eg with s3fs filesystem to read a specific parquet column
Reporter: Joris Van den Bossche / @jorisvandenbossche
Note: This issue was originally created as ARROW-11000. Please see the migration documentation for further details.