aai-institute / lakefs-spec

An fsspec implementation for the lakeFS project.
http://lakefs-spec.org/
Apache License 2.0
37 stars 4 forks source link

Fix ls #260

Closed renesat closed 4 months ago

renesat commented 5 months ago

I found it when working with DuckDB. To reproduce the second bug, you can try calling this:

import duckdb
import fsspec

# Working
fsspec.filesystem("lakefs").glob("lakefs://test/main/test_dir/*.parquet")

# Not working
duckdb.register_filesystem(fsspec.filesystem("lakefs"))
duckdb.sql("FROM 'lakefs://test/main/test_dir/*.parquet'")
nicholasjng commented 5 months ago

Thanks for the PR. What exactly is the issue you are seeing?

FWIW, for me, both of these statements work (although they do different things).

EDIT: The above repro on a local lakeFS quickstart instance, slightly edited (requires a .lakectl.yaml):

import duckdb
import fsspec
import lakefs

lakefs.Repository("quickstart").create("local://quickstart", include_samples=True, exist_ok=True)

# Working
res_a = fsspec.filesystem("lakefs").glob("lakefs://quickstart/main/*.parquet")
print(res_a)

# Not working
duckdb.register_filesystem(fsspec.filesystem("lakefs"))
res_b = duckdb.sql("FROM 'lakefs://quickstart/main/*.parquet'")
print(res_b)
renesat commented 5 months ago

@nicholasjng this is more complex example with sample repo:

import duckdb
import fsspec
import lakefs

lakefs.Repository("quickstart").create("local://quickstart", include_samples=True, exist_ok=True)

fs = fsspec.filesystem("lakefs")
fs.mkdir("lakefs://quickstart/main/test")
fs.cp("lakefs://quickstart/main/lakes.parquet", "lakefs://quickstart/main/test/lakes.parquet")
fs.cp("lakefs://quickstart/main/lakes.parquet", "lakefs://quickstart/main/test/lakes2.parquet")
fs.rm("lakefs://quickstart/main/README.md")
fs.rm("lakefs://quickstart/main/data/",  recursive=True)
fs.rm("lakefs://quickstart/main/images/",  recursive=True)
fs.rm("lakefs://quickstart/main/lakes.parquet")

# Bug 1: return empty list but have "test" dir
try:
    test1 = fsspec.filesystem("lakefs").ls("lakefs://quickstart/main/")
    print(test1)
    assert test1 != []
except:
    print("test1 not working")

# Bug 2: return empty list but have "test" dir

# Not working
try:
    test2 = fsspec.filesystem("lakefs").glob("lakefs://quickstart/main/test/*.parquet")
    print(test2)
    assert test2 != []
except:
    print("test2 not working")

duckdb.register_filesystem(fsspec.filesystem("lakefs"))

# Not working
try:
    test3 = duckdb.sql("FROM read_parquet('lakefs://quickstart/main/test/*.parquet')")
    print(test3)
except:
    print("test3 not working")

# Working
test4 = duckdb.sql("FROM read_parquet(['lakefs://quickstart/main/test/lakes2.parquet', 'lakefs://quickstart/main/test/lakes2.parquet'])")
print(test4)

I create repo structure like this:

lakefs://
└── quickstart/
    └── main/
        └── test/
            ├── lakes.parquet
            └── lakes2.parquet

I think bag 2 related with bag 1.

nicholasjng commented 4 months ago

Sorry for the delay, and thanks for that new repro. I ran this locally and was able to reproduce test case 1. We should therefore add it as a regression test. Would you like to do that?

Otherwise, looks good!

nicholasjng commented 4 months ago

The test failure happens because the path is not stripped of its leading protocol in test_ls_on_commit.

Changing L488 in fs.ls() to path = self._strip_protocol(path) solves the issue.

renesat commented 4 months ago

@nicholasjng I add test and find out why I have problem with line 430. It's because of incorrect cache update for directory with only directories in it. I fix it by removing "/" from the end of dirs names.

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (1586b02) 94.24% compared to head (e76dc9f) 94.24%. Report is 2 commits behind head on main.

:exclamation: Current head e76dc9f differs from pull request most recent head 3ed1f7a. Consider uploading reports for the commit 3ed1f7a to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #260 +/- ## ======================================= Coverage 94.24% 94.24% ======================================= Files 5 5 Lines 382 382 Branches 70 70 ======================================= Hits 360 360 Misses 13 13 Partials 9 9 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

renesat commented 4 months ago

@AdrianoKF Done👌

nicholasjng commented 4 months ago

Thank you!