fsspec / ipfsspec

readonly python fsspec implementation for IPFS
MIT License
21 stars 10 forks source link

Support symlinks and other types in ls #27

Open Erotemic opened 1 year ago

Erotemic commented 1 year ago

Fixes https://github.com/fsspec/ipfsspec/issues/26

Adds some documentation to reference where to learn more about the gateway rest calls, and fixes an issue where ls would fail when a directory contained symlinks or non directory non file data.

I'm not sure what raw or shards are, but ls will list them now. I exepct other parts of the library will have issues with dealing with theses as well.

Erotemic commented 1 year ago

please note that generally the async implementation of ipfsspec should be preferred, and probably the synchronous implementation should be dropped at some point (it's relatively easy to use the async implementation in a synchronous setting, but not the other way around.)

Can you give me an example of how its easy to use async in non-async code? From what I understand you have to create and manage an event loop, which is a lot more code overhead than I want.

Unless I'm missing something and it is a lot easier to use the async variant than I think it is, then I would strongly suggest not dropping the synchronous variant. It's important for immediate feedback when people are working interactively in IPython / Jupyter settings. Also I believe doctests are easier to write with sync code (you don't want to have two lines of setup loop / execute loop boilerplate in a doctest if you just want to access one file).

d70-t commented 1 year ago

Ok, maybe the formulation wasn't ideal. To make async code sync, you "just" have to run an event loop and wait for the async code to complete. This has relatively little overhead. On the other hand, if you want to make sync code async, you'll have to create a new thread for every task you start which is much less efficient, especially if you're doing I/O and want to fetch 100s of requests concurrently.

But I agree, there's a bit of boilerplate if you want to make async code look like sync code. Fortunately, fsspec's AsyncFileSystem provides a flag asynchronous which you can use to switch from async to sync API, so in this particular case, it's easy to use. In fact, you'll have the synchronous API by default.

One additional benefit you'll get from the synchronous-from-asynchronous API is, that fsspec allows to pass in multiple requests in one synchronous call (e.g. cat may be called with a list of paths). The sync-async translation is built such that this single synchronous call will spawn multiple asynchronous tasks which will all request their data in parallel, which removes a lot of round-trip time.

An example would be fsspec's HTTP implementation, which is implemented as async internally as well:

import fsspec
print(fsspec.open("http://httpbin.org/uuid").open().read())

or, of course, the AsyncIPFSFileSystem, which you should be able to use in just the same way as the synchronous one.

Erotemic commented 1 year ago

Ah, I see. Thank you for the clarification. I can just use:

import fsspec
fs_cls = fsspec.get_filesystem_class('ipfs')
fs = fs_cls(asynchronous=False)
results = fs.ls("bafybeief7tmoarwmd26b2petx7crtvdnz6ucccek5wpwxwdvfydanfukna")
print(results)

And that's still using the async implementation. Given that I'm perfectly comfortable with avoiding / dropping the synchronous version.

And the async implementation with asynchronous=True would look like this:

async def test_ipfs_async():
    import fsspec
    fs_cls = fsspec.get_filesystem_class('ipfs')
    fs = fs_cls(asynchronous=True)
    session = await fs.set_session()  # creates client
    result = await fs._ls("bafybeief7tmoarwmd26b2petx7crtvdnz6ucccek5wpwxwdvfydanfukna")
    print(result)
    await session.close()  # explicit destructor

if __name__ == '__main__':
    import asyncio
    asyncio.run(test_ipfs_async())