apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv` #31666

Closed asfimport closed 2 years ago

asfimport commented 2 years ago

pyarrow.fs.S3FileSystem.open_input_file and pyarrow.fs.S3FileSystem.open_input_stream performs very poorly when used with Pandas' read_csv.


import pandas as pd
import time
from pyarrow.fs import S3FileSystem

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = S3FileSystem(
        anonymous=True,
        region="us-east-2",
        endpoint_override=None,
        proxy_options=None,
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    # fhandler = fs.open_input_stream(
    #     "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    # )
    fhandler = fs.open_input_file(
        "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)

Output:


Running...
Time to create fs:  0.0003612041473388672
Time to create fhandler:  0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041

This is with pandas==1.4.2.

Getting similar performance with fs.open_input_stream as well (commented out in the code).


Running...
Time to create fs:  0.0002570152282714844
Time to create fhandler:  0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702

When running it with just pandas (which uses s3fs under the hood), it's much faster:


import pandas as pd
import time

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    year_2016_df = pd.read_csv(
        "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)

Output:


Running...
read time: 1.1012001037597656
total time: 1.101264238357544

Surprisingly, when we use fsspec's ArrowFSWrapper, it's matches s3fs performance:


import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = ArrowFSWrapper(
        S3FileSystem(
            anonymous=True,
            region="us-east-2",
            endpoint_override=None,
            proxy_options=None,
        )
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    fhandler = fs._open(
        "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)

Output:


Running...
Time to create fs:  0.0002467632293701172
Time to create fhandler:  0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492

Packages:


pyarrow=7.0.0
pandas : 1.4.2
numpy : 1.20.3

I tested it with 4.0.1, 5.0.0 as well and saw similar results.

Environment: MacOS 12.1 MacBook Pro Intel x86 Reporter: Sahil Gupta / @sahil1105 Assignee: Antoine Pitrou / @pitrou

PRs and other links:

Note: This issue was originally created as ARROW-16272. Please see the migration documentation for further details.

asfimport commented 2 years ago

David Li / @lidavidm: Off the top of my head this is possibly because s3fs adds some readahead by default, which helps CSV a lot, and PyArrow's filesystem does not do this. PyArrow's CSV reader doesn't really need this since it's multithreaded (which effectively gives readahead) but Pandas's CSV reader may not do this.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Hmm, thanks for the report. For now, this can be worked around by wrapping the file in a {}io.BufferedReader{}.

But we should take a look at the underlying issue and find a way to fix it. It seems that, despite {}nrows=100{}, the S3 filesystem is reading 2 GB from the file...

asfimport commented 2 years ago

Sahil Gupta / @sahil1105: Thanks @pitrou  !

asfimport commented 2 years ago

Sahil Gupta / @sahil1105:

It seems that, despite {}nrows=100{}, the S3 filesystem is reading 2 GB from the file...

Yes, that's what we observed as well.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: The use case is fixed with https://github.com/apache/arrow/pull/13264 :


Running...
Time to create fs:  2.0029425621032715
Time to create fhandler:  0.4456977844238281
read time: 0.5826966762542725
    Summons Number Plate ID Registration State Plate Type  Issue Date  Violation Code  ... Community Board Community Council  Census Tract  BIN  BBL  NTA
0       1363745270  GGY6450                 99        PAS  07/09/2015              46  ...             NaN                NaN          NaN  NaN  NaN  NaN
1       1363745293   KXD355                 SC        PAS  07/09/2015              21  ...             NaN                NaN          NaN  NaN  NaN  NaN
2       1363745438  JCK7576                 PA        PAS  07/09/2015              21  ...             NaN                NaN          NaN  NaN  NaN  NaN
3       1363745475  GYK7658                 NY        OMS  07/09/2015              21  ...             NaN                NaN          NaN  NaN  NaN  NaN
4       1363745487  GMT8141                 NY        PAS  07/09/2015              21  ...             NaN                NaN          NaN  NaN  NaN  NaN
..             ...      ...                ...        ...         ...             ...  ...             ...                ...          ...  ...  ...  ...
95      1363748464  GFV8489                 NY        PAS  07/09/2015              21  ...             NaN                NaN          NaN  NaN  NaN  NaN
96      1363748476   X15EGU                 NJ        PAS  07/09/2015              20  ...             NaN                NaN          NaN  NaN  NaN  NaN
97      1363748490  GDM1774                 NY        PAS  07/09/2015              38  ...             NaN                NaN          NaN  NaN  NaN  NaN
98      1363748531   G45DSY                 NJ        PAS  07/09/2015              37  ...             NaN                NaN          NaN  NaN  NaN  NaN
99      1363748579   RR76Y0                 PA        PAS  07/09/2015              20  ...             NaN                NaN          NaN  NaN  NaN  NaN

[100 rows x 51 columns]
total time: 3.0595762729644775
asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Issue resolved by pull request 13264 https://github.com/apache/arrow/pull/13264