filecoin-project / lassie

A minimal universal retrieval client library for IPFS and Filecoin
Other
109 stars 17 forks source link

Support pathing semantics #82

Closed hannahhoward closed 1 year ago

hannahhoward commented 1 year ago

What

Lassie should support the following semantics for paths, which match the gateway semantics:

lassie fetch CID/path/to/file

This should return a CAR containing the blocks for CID, intermediate CIDs between the CID and the root CID of file, and all of the cids that make up file.

lassie fetch CID/path/to/subdir

This should return a CAR containing the blocks for CID, intermediate CIDs between the CID and the root CID of subdir, and all of the cids that make up the directory tree for subdir, but NOT any files or directories associated with subdir (ultimately this means just a single block unless the directory is a HAMT)

lassie fetch CID/path/to/subdir?all

This should return a CAR containing the blocks for CID, intermediate CIDs between the CID and the root CID of subdir, and all of the cids that make up the directory tree for subdir, AND any files or directories associated with subdir AND any additional subdirectories recursively

Selector conversion

I believe there is a relatively straight forward conversion between this structure and selectors:

For each segment of the path, I think we need an ExploreInterpretAs with KnownADL "unixfs" followed by an ExploreFields.

At the end of the path, I believe we want an ExploreInterpretAs "unixfs-preload" followed by a "Matcher" selector, except in the case of the all query parameter where I believe we can simply use an ExploreRecursiveAll selector

Paramater passing

We will need to add either a path & query string to the RetrievalRequest type, or a selector -- it's kind of implementors choice and I'm not sure I have an opinion myself, though early conversion to selector locks us into selector traversal for bitswap (see https://github.com/ipfs/go-fetcher for what this ends up looking like)

willscott commented 1 year ago

The full pathing semantics requested for gateway traffic efficiently is at https://www.notion.so/pl-strflt/HTTP-Gateway-Requests-for-Graphs-as-CARs-001d2a9f5a35418bb0fb7d9d182d24ec?pvs=4#096312bc3dc7471cbbea9e97645bf62f

rvagg commented 1 year ago

Mostly being done in https://github.com/filecoin-project/lassie/pull/119

Current team understanding/agreement of what we're aiming for (initially at least) for Rhea:

  1. All requests are CID + path, where path may be empty.
  2. The path is interpreted as UnixFS where possible (plain IPLD where not)
  3. All blocks from the CID to the path terminus are included in the returned data
  4. Requests may either be “full” or “shallow”
    1. A “full” request will attempt to fetch and return all blocks that make up the DAG that exists below the path terminus
    2. A “shallow” request will attempt to fetch and return all blocks that make up the DAG that represents whatever the path terminates on:
      1. In the case of a UnixFS file, all blocks that make up the file
      2. In the case of a UnixFS directory, all blocks that make up just the directory—if it’s a HAMT, then all blocks of the HAMT, but no more.
      3. In the case of non-UnixFS data, just the terminus block.
    3. Exactly how “full” or “shallow” are provided to lassie are tbd, a “depth=int” query parameter doesn’t seem to make sense.
      1. ?full + none for shallow
      2. ?depthType=full + ?depthType=shallow (or none, shallow default?)
      3. ?complete=full + ?complete=shallow (or none, shallow default?)
  5. Range requests are an upstream concern—it is assumed the entire resource will be fetched and just the range requested is served back to the user.
  6. HEAD requests are just a special case of range requests that return the first 1024 bytes of the UnixFS data, and are therefore an upstream concern.
willscott commented 1 year ago