Open asfimport opened 2 years ago
Joris Van den Bossche / @jorisvandenbossche:
[~jon-rosenberg-env]
thanks for the report. That looks like an annoying issue!
I am not very familiar with ADL myself (or access to it for testing. Do they have public datasets that can be used to test without an account like you can have public S3 buckets?), so can't directly help with diagnosing this issue. But a few questions:
Can you try passing a adlfs
filesystem object manually? Something like
import adlfs
import pyarrow.parquet as pq
adl = adlfs.AzureDatalakeFileSystem(...)
pq.read_table("...", filesystem=adl)
We have had previous reports related to Azure Data Lake, so while there have been issues before, that also indicates it was at least possible to read from that to a certain extent. cc @ldacey did you ever run into this specific issue?
Lance Dacey / @ldacey: I am not sure about any public datasets. Locally, I use azurite for testing which can be installed or run as a Docker container. Note that I only use Azure Blob and not Azure Data Lake, so there might be some differences I am not aware of.
I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet data from Azure. I did a couple of tests with double slashes in the path. Perhaps I misunderstood what the original issue was, but it looks like I can read the data with pq.read_table and with pandas using fs.open() and storage_options. I pasted my quick tests below.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal
URL = "http://127.0.0.1:10000"
ACCOUNT_NAME = "devstoreaccount1"
KEY = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"
@pytest.fixture
def example_data():
return {
"date_id": [20210114, 20210811],
"id": [1, 2],
"created_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"updated_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"category": ["cow", "sheep"],
"value": [0, 99],
}
def test_double_slashes(example_data):
fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, connection_string=CONN_STR)
fs.mkdir("resource")
path = "resource/path/to//parquet/files/part-001.parquet"
table = pa.table(example_data)
pq.write_table(table, where=path, filesystem=fs)
# use pq.read_table() with filesystem
new = pq.read_table(source=path, filesystem=fs)
assert new == table
# use adlfs filesystem.open()
df = pd.read_parquet(fs.open(path, mode="rb"))
dataframe_table = pa.Table.from_pandas(df)
assert table == dataframe_table
# use abfs path with storage options
df2 = pd.read_parquet(f"abfs://{path}", storage_options={"connection_string": CONN_STR})
assert_frame_equal(df, df2)
Jon Rosenberg: I'm not sure about public paths, I'll see if I can get something more specific to running inside the azurite image later today, but am seeing that the test code here is slightly different in:
I'm just specifying the full datalake path, and not specifying filesystem or storage option in my pandas read, but with my environment variables with azure credentials, and using the scheme of the passed url, pandas is not having an issue connecting to the lake. My hunch is this usage detail shouldn't affect my issue, but I'll verify when testing later.
I'm passing in the path to the partitioned files, not any file itself. That is, instead of
abfs://resource/path/to//parquet/files/part-001.parquet
I would be passing
```java
abfs://resource/path/to//parquet/files
which requires an ls from adlfs to retrieve the parquet files to concatenate, and the ls is performed successfully returning the list of files EXCEPT in the returned list of the directory files from adlfs the double slash is not included in the paths, returning:
resource/path/to/parquet/files/part-001.parquet
NOT
resource/path/to//parquet/files/part-001.parquet
and thus PyArrow was throwing an exception for me on being outside
resource/path/to//parquet/files
despite otherwise being able to proceed with the read if not for this check.
Jon Rosenberg: OK, I had to do some separate testing since azurite is for blob storage and not adl, but it does seem there is a difference between how the two behave.
It appears that in blob storage
resource/path/to//parquet/files
is a valid and distinct path from
resource/path/to/parquet/files
Changing the write in your test to write to a path with only one slash but keeping the double slash in the read tests caused a failure for me, but it appeared to be due to reading an empty location.
In the data lake however any double slash path is interpreted the same as a single slash, which is what my error is arising out of. I unfortunately still don't have a public datalake path however but will look around for such a reproduction.
Reading a partitioned parquet from adlfs with pyarrow through pandas will throw unnecessary exceptions on not matching forward slashes in the listed files returned from adlfs, ie:
results in exception of the form
and testing with modifying the adlfs method to prepend slashes to all returned files, we still end up with an error on file paths that would otherwise be handled correctly where there is a double slash in a location where there should be one, ie:
would throw
In both cases the ls has returned correctly from adlfs, given it's discovered the file part-0001.parquet but the pyarrow exception stops what could otherwise be successful processing.
Reporter: Jon Rosenberg
Note: This issue was originally created as ARROW-16077. Please see the migration documentation for further details.