filecoin-project / devgrants

👟 Apply for a Filecoin devgrant. Help build the Filecoin ecosystem!
Other
377 stars 308 forks source link

IPNI Reverse Index #1781

Open bajtos opened 3 months ago

bajtos commented 3 months ago

Open Grant Proposal: IPNI Reverse Index

Project Name: IPNI Reverse Index

Proposal Category: Developer and data tooling

Individual or Entity Name:

Proposer: @bajtos

Project Repo(s)

(Optional) Filecoin ecosystem affiliations: People who will implement these changes have nucleated from Protocol Labs and are working for new companies now.

(Optional) Technical Sponsor: @willscott

Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Yes

Project Summary

In Filecoin, the main unit for storing user data is Piece, as identified by Piece CID (see Filecoin Spec). On the other hand, the data retrieval process operates at the payload level. The client requests data using a payload CID and receives back the IPLD DAG of the payload.

To drive improvements in availability of Filecoin content retrieval, we need to measure the quality of the retrieval service provided by Storage Providers. The on-chain state, events and history provide only the PieceCID information about stored data. Retrieval probes need to map PieceCID to PayloadCIDs to check if the content can be retrieved. There is no straightforward solution for such mapping right now.

This project aims to enable retrieval probes to query IPNI to obtain a sample of Payload CIDs advertised by a given Storage Provider for a given deal (PieceCID).

The project will require changes in the IPNI indexer implementation (storetheindex, cid.contact), and the index provider implemented by Curio.

See the following document for more information: https://docs.google.com/document/d/1jhvP48ccUltmCr4xmquTnbwfTSD7LbO1i1OVil04T2w

Impact

The on-chain state, events, and history only provide the PieceCID information about stored data. Retrieval probes need to map PieceCID to PayloadCIDs to probe for retrievability. There is no straightforward solution for such mapping right now.

If we get this right, we will empower developers to build alternative retrieval-probing networks, new reputation systems, and an array of diagnostic tooling.

If we don’t get this right or make no improvements, building a retrieval probe will remain a technical challenge that requires deep knowledge of Filecoin actors, on-chain state, and the IPNI advertisement protocol. It will be unlikely that alternative retrieval-probing networks emerge.

When this project is successful, Spark - the retrieval-probing network powered by Filecoin Stations - will be able to test the retrieval of deals using the recently introduced DirectDataOnboarding. Such probing will drive further improvements in the retrieval success rate of FIL+ deals. If very successful, then there will be at least one other retrieval-probing network using this reverse index and creating healthy diversity & competition in the ecosystem of retrieval probing.

UPDATE 2024-10-02

After submitting the application, I learned that Boost will soon be deprecated and replaced by Curio. Curio will not support Graphsync and StorageMarket deals; it will only support Trustless HTTP GW retrievals and DDO deals. This will make it virtually impossible for third parties (e.g. retrieval checkers like Spark) to find which payload CIDs are stored in FIL+ deals made with Curio.

Outcomes

  1. When Curio advertises data to IPNI, it does so in a way that enables third parties like Spark to link Filecoin deals to payload blocks.
  2. IPNI at cid.contact provides a new REST API endpoint for sampling payload blocks linked to a Filecoin deal.
  3. A specification or documentation allowing alternative provider implementations like Venus to implement the same mechanism.

Important

The desired end-to-end workflow from user’s perspective:

  1. A Piece is added to a Curio instance running a publicly released version of Curio.
  2. Curio announces payload blocks included in that Piece to IPNI. (This happens automatically in the background.)
  3. A client queries cid.contact to obtain a sample of the payload blocks from the Piece.

Please refer to the design doc for more details: https://docs.google.com/document/d/1jhvP48ccUltmCr4xmquTnbwfTSD7LbO1i1OVil04T2w/

How to measure the success:

Adoption, Reach, and Growth Strategies

At the high level, our target audience consists of all storage providers. We want them to adopt the latest Curio version and configure it to correctly advertise to IPNI.

From another perspective, our target audience is the builders community that may want to build an alternative retrieval-probing network, new reputation systems, new diagnostic tooling, or perhaps use the new reverse index for use cases we cannot imagine yet.

To streamline the adoption, we are including documentation updates as part of this project.

Development Roadmap

Milestone 1: Design Spec

Deliverables:

Out of scope:

Planning:

Milestone 2: IPNI Implementation

Deliverables:

Budget:

Milestone 3: Curio Implementation

Deliverables:

Budget:

Total Budget Requested

Will send to grants@fil.org

Maintenance and Upgrade Plans

We expect the IPNI and Curio maintainers to maintain these new features as part of their existing maintenance work arrangements.

Team

Team Members

Team Member LinkedIn Profiles

Team Website

https://filspark.com/

Relevant Experience

Team code repositories

n/a

Additional Information

You can find us all in the Filecoin Slack workspace.

The best email address for discussing the next steps: spark@meridian.space