fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
251 stars 44 forks source link

Add support for Read-only ZipFileSystem + fix in Windows #66

Closed rdbisme closed 1 year ago

rdbisme commented 2 years ago

This should add support for a zip file zip://path_to_my_file.zip.

I'm not sure if there's an easy way to rewrite correctly the paths, but the UPath object creation is quite difficult to understand (maybe it would be easier just to expose the pathlib.Path protocol without subclassing?)

There's also a fix for Windows path of the type C:\\Users\\user\\file.zip. They were handled incorrectly.

normanrz commented 2 years ago

Hi @rdbisme, thanks for making this PR!

Could you please add a test for the Windows path handling fix that you created? The would be failing with the current main branch and successful with your branch.

I like the addition of the zip file system. Zip is a good example of where nested file systems are useful. In your PR, you are assuming a local filesystem as underlying filesystem. However, the zip files could also be stored on HTTP, S3, etc. It would be nice to come up with a solution that also works for these. We touched on this also in #60. Do you have any thoughts on that?

(maybe it would be easier just to expose the pathlib.Path protocol without subclassing?)

I think subclassing is very useful, because it makes typing and isinstance easier. I am open suggestions for how to make the object creation easier to understand, though.

Would your implementation also work with nested zip files, e.g. zip://path/to/first.zip/another.zip/actual.file?

rdbisme commented 2 years ago

@normanrz The current tests were already failing on Windows without the UPath._sanitize_path addition.

Would your implementation also work with nested zip files, e.g. zip://path/to/first.zip/another.zip/actual.file?

I'll add a test to check this

I like the addition of the zip file system. Zip is a good example of where nested file systems are useful. In your PR, you are assuming a local filesystem as underlying filesystem. However, the zip files could also be stored on HTTP, S3, etc. It would be nice to come up with a solution that also works for these. We touched on this also in https://github.com/fsspec/universal_pathlib/pull/60. Do you have any thoughts on that?

Could we have this merged and maybe fix the nested usage in another PR if it doesn't work?

normanrz commented 2 years ago

The current tests were already failing on Windows without the UPath._sanitize_path addition.

Not sure I understand. The tests run in our CI just fine, also on Windows. See https://github.com/fsspec/universal_pathlib/runs/7010368282?check_suite_focus=true

I like the addition of the zip file system. Zip is a good example of where nested file systems are useful. In your PR, you are assuming a local filesystem as underlying filesystem. However, the zip files could also be stored on HTTP, S3, etc. It would be nice to come up with a solution that also works for these. We touched on this also in #60. Do you have any thoughts on that?

Could we have this merged and maybe fix the nested usage in another PR if it doesn't work?

For the sake of stability, I'd rather not merge something that will needs to be changed with breaking changes shortly after. For example, we could choose to introduce a new syntax for s3 + zip paths (e.g. zip://actual.file::s3://bucket/path/to/file.zip as in the fsspec package). That would be quite different from the implementation that you are proposing and would require breaking changes that affect users.

rdbisme commented 2 years ago

The current tests were already failing on Windows without the UPath._sanitize_path addition.

Not sure I understand. The tests run in our CI just fine, also on Windows. See https://github.com/fsspec/universal_pathlib/runs/7010368282?check_suite_focus=true

What I mean is that the tests I've added to test the ZipFileSystem implementation were failing on my Windows machine (running under an MSYS environment, i.e. Git for Windows and similar environment). Let me check again if running them in Powershell would work instead.

For the sake of stability, I'd rather not merge something that will needs to be changed with breaking changes shortly after. For example, we could choose to introduce a new syntax for s3 + zip paths (e.g. zip://actual.file::s3://bucket/path/to/file.zip as in the fsspec package). That would be quite different from the implementation that you are proposing and would require breaking changes that affect users.

Does fsspec support nested filesystems?

normanrz commented 2 years ago

For the sake of stability, I'd rather not merge something that will needs to be changed with breaking changes shortly after. For example, we could choose to introduce a new syntax for s3 + zip paths (e.g. zip://actual.file::s3://bucket/path/to/file.zip as in the fsspec package). That would be quite different from the implementation that you are proposing and would require breaking changes that affect users.

Does fsspec support nested filesystems?

I haven't tried it, but the docs say so. I don't think it is trivial to implement that for UPath, though. For example, what would UPath("zip://folder::s3://bucket/path/to/file.zip") / "actual.file" resolve to? UPath("zip://folder/actual.file::s3://bucket/path/to/file.zip") or UPath("zip://folder::s3://bucket/path/to/file.zip/actual.file"). Also, how can nested paths be constructed in a pathlib-y way? Perhaps, UPath("s3://bucket/path/to/file.zip") / "zip://folder/actual.file"