fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

Sanitize some common TAR path occurrences such as leading dots ./ #1568

Open mxmlnkn opened 7 months ago

mxmlnkn commented 7 months ago

Hi there,

we already wrote a bit back and forth in the issue in the smart_open repository. I wanted to give fsspec.fuse a quick try using the tar/libarchive backend. Unluckily, my very first test failed. I created the test tar with:

echo foo > large
tar -cf ./large{.tar,}
tar tvlf large.tar  # -rwx------ user/user   4 2024-04-10 23:04 ./large

Then, I tried to mount it with:

from fsspec.implementations.tar import TarFileSystem as tafs
fs = tafs("large.tar")
import fsspec.fuse
fsspec.fuse.run(fs, "", "mounted")'

and access it with:

ls -la mounted/
# total 4
# drwxrwxrwx 0 user user    0 Apr 10 23:07 .
# drwxrwxrwx 0 user user    0 Apr 10 23:07 .
# drwx------ 1 user user 4096 Apr 10 23:04 ..
ls -la mounted/./
# total 4
# drwxrwxrwx 0 user user    0 Apr 10 23:07 .
# drwxrwxrwx 0 user user    0 Apr 10 23:07 .
# drwx------ 1 user user 4096 Apr 10 23:04 ..

As you can see, the leading dot is interpreted as a valid folder even though it isn't. And even though it is shown because of the FUSE-specifics, which already normalizes paths before the implementation is called, it is not possible to access the large file.

I think, the TAR and libarchive backends should normalize paths to some degree. At least leading dots. Maybe also . and .. inside the path. Funnily enough, I had the exact same issue with fuse-archive: https://github.com/google/fuse-archive/issues/2 . There are also more complex cases, e.g., try this:

tar -cf large.tar ./././large
tar tvlf large.tar 
#-rwx------ user/user   4 2024-04-10 23:04 ./././large

I was not able to create a path with .. in it. Gnu tar strips it and even the leading dots when a .. occurs. But it might be possible to create such TARs with Python's tarfile and/or with other tools.

Specifying path = "./" to fsspec.fuse.run kinda works around this issue and large will be visible in the mount point, but:

echo foo > bar
echo foo > large
tar -cf large.tar bar ./large
tar tvlf large.tar 
# -rwx------ user/user   4 2024-04-10 23:20 bar
# -rwx------ user/user   4 2024-04-10 23:04 ./large
python3.12 -c '
from fsspec.implementations.tar import TarFileSystem as tafs
fs = tafs("large.tar")
import fsspec.fuse
fsspec.fuse.run(fs, "", "mounted")' &
ls -la mounted
# ls: cannot access 'mounted/bar': No such file or directory
# total 4
# drwxrwxrwx 0 user user    0 Apr 10 23:21 .
# drwxrwxrwx 0 user user    0 Apr 10 23:21 .
# drwx------ 1 user user 4096 Apr 10 23:20 ..
# ?????????? ? ?       ?          ?            ? bar
fusermount -u mounted
python3.12 -c '
from fsspec.implementations.tar import TarFileSystem as tafs
fs = tafs("large.tar")
import fsspec.fuse
fsspec.fuse.run(fs, "./", "mounted")' &
ls -la mounted
# total 4
# drwxrwxrwx 0 user user    0 Apr 10 23:23 .
# drwx------ 1 user user 4096 Apr 10 23:20 ..
# -rwx------ 1 user user    4 Apr 10 23:04 large

Note also how something goes wrong with bar resulting in the metadata not getting shown.

Btw, I was wondering about the path parameter to fsspec.fuse.run. It didn't make sense for such a required parameter to exist, so I had to read up the pretty good online manual, but still, I'd prefer the parameter to be something like prefix = "/", i.e., a self-explanatory name and a default that should cover 99% of the use cases.

mxmlnkn commented 7 months ago

The fuse mount does not seem to work at all for files that do not have any prefix. I.e., the bar above.

echo foo > bar
echo foo > bar2
tar -cf foo.tar bar bar2

python3.12 -c '
from fsspec.implementations.tar import TarFileSystem as tafs
fs = tafs("large.tar")
import fsspec.fuse
fsspec.fuse.run(fs, "./", "mounted")' &
ls -la mounted

I cannot get foo.tar mounted at all. No matter what I try for the second argument:

martindurant commented 7 months ago

The TAR specific issues with paths is probably something independent of FUSE, and maybe you could help write some test cases (or fixes??) as a PR.