apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
468 stars 170 forks source link

ValueError: Unrecognized filesystem type in URI: abfss #527

Closed djouallah closed 8 months ago

djouallah commented 8 months ago

Apache Iceberg version

None

Please describe the bug 🐞

reproducible example https://colab.research.google.com/drive/1EjffJO75-8Rj4V0MGKUsoFHDOGgicKgK#scrollTo=8WRyLlmyXnXu

kevinjqliu commented 8 months ago

Pulling the example out of the notebook:

table = catalog.create_table(
    "default.taxi_dataset",
    schema=df.schema,location="abfss://onelakene.dfs.core.windows.net/aemo/iceberg"
)

error:

WARNING:pyiceberg.io:Could not initialize FileIO: pyiceberg.io.fsspec.FsspecFileIO
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-f052e36c7671> in <cell line: 1>()
----> 1 table = catalog.create_table(
      2     "default.taxi_dataset",
      3     schema=df.schema,location="abfss://onelakene.dfs.core.windows.net/aemo/iceberg"
      4 )

3 frames
/usr/local/lib/python3.10/dist-packages/pyiceberg/io/pyarrow.py in _initialize_fs(self, scheme, netloc)
    390             return PyArrowLocalFileSystem()
    391         else:
--> 392             raise ValueError(f"Unrecognized filesystem type in URI: {scheme}")
    393 
    394     def new_input(self, location: str) -> PyArrowFile:

ValueError: Unrecognized filesystem type in URI: abfss
kevinjqliu commented 8 months ago

abfss isn't currently supported in pyarrow FS implementation https://github.com/apache/iceberg-python/blob/7f712fdad025a2110816ec217616de54631f1e3e/pyiceberg/io/pyarrow.py#L339-L393

but it is available in the fsspec implementation https://github.com/apache/iceberg-python/blob/7f712fdad025a2110816ec217616de54631f1e3e/pyiceberg/io/fsspec.py#L181-L182

kevinjqliu commented 8 months ago

Looks like pyarrow can support "fsspec-compatible filesystems" like Azure Blob Storage (abfs/abfss) https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow

There's an issue open to make fsspec and pyarrow filesystems cross-compatible #310

kevinjqliu commented 8 months ago

In the meantime, I think you might be able to workaround this by explicitly using fsspec. You'd have to set this in the catalog properties setting

catalog = SqlCatalog(
    "default",
    **{
        "uri": "sqlite:///:memory:",
        "adlfs.account-name": userdata.get("account_name") ,
        "adlfs.account-key": userdata.get ("AZURE_STORAGE_ACCOUNT_KEY"),
        "adlfs.tenant-id" : userdata.get("azure_storage_tenant_id"),
        "py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"
    },
)

https://github.com/apache/iceberg-python/blob/7f712fdad025a2110816ec217616de54631f1e3e/pyiceberg/catalog/sql.py#L185

djouallah commented 8 months ago

getting this error now

 319             return file_io
    320         else:
--> 321             raise ValueError(f"Could not initialize FileIO: {io_impl}")
    322 
    323     # Check the table location

ValueError: Could not initialize FileIO: pyiceberg.io.fsspec.FsspecFileIO
kevinjqliu commented 8 months ago

oh interesting, this is because of a dependency issue.

The actual error shows up when you try to import that class

from pyiceberg.io.fsspec import FsspecFileIO

fsspec has a dependency on botocore and botocore is not installed with !pip install -q pyiceberg[adlfs] https://github.com/apache/iceberg-python/blob/7f712fdad025a2110816ec217616de54631f1e3e/pyiceberg/io/fsspec.py#L33-L34

To resolve this, install botocore

!pip install botocore

In the future, we'll get rid of this dependency issue. This is caught by deptry in #528.

djouallah commented 8 months ago

Thanks, it works now