apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path #20176

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Reading a partitioned parquet from adlfs with pyarrow through pandas will throw unnecessary exceptions on not matching forward slashes in the listed files returned from adlfs, ie:

 


import pandas as pd

pd.read_parquet("adl://resource/path/to/parquet/files")

results in exception of the form


pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to/parquet/files/'

 

and testing with modifying the adlfs method to prepend slashes to all returned files, we still end up with an error on file paths that would otherwise be handled correctly where there is a double slash in a location where there should be one, ie:

 


import pandas as pd

pd.read_parquet("adl://resource/path/to//parquet/files") 

would throw


pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path '/path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to//parquet/files/' 

In both cases the ls has returned correctly from adlfs, given it's discovered the file part-0001.parquet but the pyarrow exception stops what could otherwise be successful processing. 

Reporter: Jon Rosenberg

Note: This issue was originally created as ARROW-16077. Please see the migration documentation for further details.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: [~jon-rosenberg-env] thanks for the report. That looks like an annoying issue!

I am not very familiar with ADL myself (or access to it for testing. Do they have public datasets that can be used to test without an account like you can have public S3 buckets?), so can't directly help with diagnosing this issue. But a few questions:

Can you try passing a adlfs filesystem object manually? Something like


import adlfs
import pyarrow.parquet as pq

adl = adlfs.AzureDatalakeFileSystem(...)
pq.read_table("...", filesystem=adl)

We have had previous reports related to Azure Data Lake, so while there have been issues before, that also indicates it was at least possible to read from that to a certain extent. cc @ldacey did you ever run into this specific issue?

asfimport commented 2 years ago

Lance Dacey / @ldacey: I am not sure about any public datasets. Locally, I use azurite for testing which can be installed or run as a Docker container. Note that I only use Azure Blob and not Azure Data Lake, so there might be some differences I am not aware of.

I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet data from Azure. I did a couple of tests with double slashes in the path. Perhaps I misunderstood what the original issue was, but it looks like I can read the data with pq.read_table and with pandas using fs.open() and storage_options. I pasted my quick tests below.


import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal

URL = "http://127.0.0.1:10000"
ACCOUNT_NAME = "devstoreaccount1"
KEY = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"

@pytest.fixture
def example_data():
    return {
        "date_id": [20210114, 20210811],
        "id": [1, 2],
        "created_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "updated_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "category": ["cow", "sheep"],
        "value": [0, 99],
    }

def test_double_slashes(example_data):
    fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, connection_string=CONN_STR)
    fs.mkdir("resource")
    path = "resource/path/to//parquet/files/part-001.parquet"
    table = pa.table(example_data)
    pq.write_table(table, where=path, filesystem=fs)

    # use pq.read_table() with filesystem
    new = pq.read_table(source=path, filesystem=fs)
    assert new == table

    # use adlfs filesystem.open()
    df = pd.read_parquet(fs.open(path, mode="rb"))
    dataframe_table = pa.Table.from_pandas(df)
    assert table == dataframe_table

    # use abfs path with storage options
    df2 = pd.read_parquet(f"abfs://{path}", storage_options={"connection_string": CONN_STR})
    assert_frame_equal(df, df2)
asfimport commented 2 years ago

Jon Rosenberg: I'm not sure about public paths, I'll see if I can get something more specific to running inside the azurite image later today, but am seeing that the test code here is slightly different in:

  1. I'm just specifying the full datalake path, and not specifying filesystem or storage option in my pandas read, but with my environment variables with azure credentials, and using the scheme of the passed url, pandas is not having an issue connecting to the lake. My hunch is this usage detail shouldn't affect my issue, but I'll verify when testing later.

  2. I'm passing in the path to the partitioned files, not any file itself. That is, instead of

abfs://resource/path/to//parquet/files/part-001.parquet

I would be passing
```java

abfs://resource/path/to//parquet/files 

which requires an ls from adlfs to retrieve the parquet files to concatenate, and the ls is performed successfully returning the list of files EXCEPT in the returned list of the directory files from adlfs the double slash is not included in the paths, returning:


resource/path/to/parquet/files/part-001.parquet 

NOT


resource/path/to//parquet/files/part-001.parquet 

and thus PyArrow was throwing an exception for me on being outside 


resource/path/to//parquet/files 

despite otherwise being able to proceed with the read if not for this check.

asfimport commented 2 years ago

Jon Rosenberg: OK, I had to do some separate testing since azurite is for blob storage and not adl, but it does seem there is a difference between how the two behave.

It appears that in blob storage


resource/path/to//parquet/files  

is a valid and distinct path from


resource/path/to/parquet/files  

Changing the write in your test to write to a path with only one slash but keeping the double slash in the read tests caused a failure for me, but it appeared to be due to reading an empty location.

In the data lake however any double slash path is interpreted the same as a single slash, which is what my error is arising out of. I unfortunately still don't have a public datalake path however but will look around for such a reproduction.