Closed asfimport closed 2 years ago
David Li / @lidavidm: Off the top of my head this is possibly because s3fs adds some readahead by default, which helps CSV a lot, and PyArrow's filesystem does not do this. PyArrow's CSV reader doesn't really need this since it's multithreaded (which effectively gives readahead) but Pandas's CSV reader may not do this.
Antoine Pitrou / @pitrou:
Hmm, thanks for the report. For now, this can be worked around by wrapping the file in a {}io.BufferedReader{
}.
But we should take a look at the underlying issue and find a way to fix it. It seems that, despite {}nrows=100{
}, the S3 filesystem is reading 2 GB from the file...
Sahil Gupta / @sahil1105: Thanks @pitrou !
Sahil Gupta / @sahil1105:
It seems that, despite
{}nrows=100{
}, the S3 filesystem is reading 2 GB from the file...
Yes, that's what we observed as well.
Antoine Pitrou / @pitrou: The use case is fixed with https://github.com/apache/arrow/pull/13264 :
Running...
Time to create fs: 2.0029425621032715
Time to create fhandler: 0.4456977844238281
read time: 0.5826966762542725
Summons Number Plate ID Registration State Plate Type Issue Date Violation Code ... Community Board Community Council Census Tract BIN BBL NTA
0 1363745270 GGY6450 99 PAS 07/09/2015 46 ... NaN NaN NaN NaN NaN NaN
1 1363745293 KXD355 SC PAS 07/09/2015 21 ... NaN NaN NaN NaN NaN NaN
2 1363745438 JCK7576 PA PAS 07/09/2015 21 ... NaN NaN NaN NaN NaN NaN
3 1363745475 GYK7658 NY OMS 07/09/2015 21 ... NaN NaN NaN NaN NaN NaN
4 1363745487 GMT8141 NY PAS 07/09/2015 21 ... NaN NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
95 1363748464 GFV8489 NY PAS 07/09/2015 21 ... NaN NaN NaN NaN NaN NaN
96 1363748476 X15EGU NJ PAS 07/09/2015 20 ... NaN NaN NaN NaN NaN NaN
97 1363748490 GDM1774 NY PAS 07/09/2015 38 ... NaN NaN NaN NaN NaN NaN
98 1363748531 G45DSY NJ PAS 07/09/2015 37 ... NaN NaN NaN NaN NaN NaN
99 1363748579 RR76Y0 PA PAS 07/09/2015 20 ... NaN NaN NaN NaN NaN NaN
[100 rows x 51 columns]
total time: 3.0595762729644775
Antoine Pitrou / @pitrou: Issue resolved by pull request 13264 https://github.com/apache/arrow/pull/13264
pyarrow.fs.S3FileSystem.open_input_file
andpyarrow.fs.S3FileSystem.open_input_stream
performs very poorly when used with Pandas'read_csv
.Output:
This is with
pandas==1.4.2
.Getting similar performance with
fs.open_input_stream
as well (commented out in the code).When running it with just pandas (which uses
s3fs
under the hood), it's much faster:Output:
Surprisingly, when we use
fsspec
'sArrowFSWrapper
, it's matches s3fs performance:Output:
Packages:
I tested it with 4.0.1, 5.0.0 as well and saw similar results.
Environment: MacOS 12.1 MacBook Pro Intel x86 Reporter: Sahil Gupta / @sahil1105 Assignee: Antoine Pitrou / @pitrou
PRs and other links:
Note: This issue was originally created as ARROW-16272. Please see the migration documentation for further details.