Retrieval Attestation - Githubissues

bajtos commented 1 year ago

Checklist

[X] This is not a new feature or an enhancement to the Filecoin protocol. If it is, please open an FIP issue.
[X] This is not brainstorming ideas. If you have an idea you'd like to discuss, please open a new discussion on the Boost forum and select the category as Ideas.
[X] I have a specific, actionable, and well motivated feature request to propose.

Boost component

[ ] boost daemon - storage providers
[ ] boost client
[ ] boost UI
[ ] boost data-transfer
[ ] boost index-provider
[X] Other

What is the motivation behind this feature request? Is your feature request related to a problem? Please describe.

In Filecoin Station, we are building SPARK - a module that periodically checks retrievability of content from Filecoin Storage Providers. At the moment, we are adding FIL rewards for performing these checks. In order to combat fraud, we would like Boost to provide retrieval attestations that will allow 3rd parties to verify that a client performed a retrieval request from a particular provider.

You can learn more in SPARK Content retrieval attestation and Meridian Design Doc 03: Evaluation dissected·

In short:·

We want the SPARK MERidian evaluation service to be able to verify that the SPARK module performed the retrieval check as the SPARK orchestrator defined it.
We also want a generic solution that can be used by other projects retrieving content from Filecoin and IPFS. This is, in particular, important to prevent retrieval servers (e.g. SPs) from being able to distinguish SPARK retrieval requests from requests made by other clients.

Describe the solution you'd like

(1) The retrieval client performing a retrieval request includes a new field in the request metadata - retrieval_id containing a string value. We recommend clients send a SHA-384 hash of the actual identifier.

(2) The retrieval server returns an attestation signed with the server’s private key - the same key as used for libp2p peer identity. The attestation payload includes the following metadata:·

retrieval_id supplied by the client,
cid being requested, and
protocol used (bitswap, graphsync, http).
(Possibly more in the future.)

This is a high-level proposal that intentionally excludes details. I’d like us to first agree whether this feature is feasible at the high level, before we dive deeper into details.

Describe alternatives you've considered

No response

Additional context

What kind of feedback I am looking for

Is this feasible to implement in Boost? Is the design, as detailed below, compatible with the project’s vision? Can you suggest a better approach?
What are the next steps to make this happen? Our timeline is a bit tight, we would like to get this rolled out to at least some SPs by end of September 2023.
Who are the best people to help us clarify implementation details and lay out a plan to get this done & deployed to SPs running Boost?
We will need retrieval clients like Lassie to support this new feature, too. For example, Lassie should forward the retrieval id from the client to the SP and then forward the attestation token back from SP to the client. Do you have any suggestions on how to get this done?

How this helps SPARK

In our current design, each retrieval job is defined by a centralised orchestrator service, which assigns a unique id to each job. (In the future, we want to move to a decentralised orchestrator based on DRAND. I believe such a design will still give us a way how to deterministically derive a unique retrieval id.)
The SPARK module will deterministically derive retrieval_id from this job id, perform the retrieval check and finally send the attestation returned by SP alongside other retrieval statistics.
The fraud detection service will verify that the retrieval attestation reported by the client matches the retrieval job as it was defined by the orchestrator.
Dishonest clients will be unable to create fake attestations. The client must send a new retrieval request to the designated SP for each job. (Unless the client and the SP are colluding and the client has access to SP’s private key, but we have other measures to minimise the impact of that.)
- Clients cannot reuse an attestation from a different job because the retrieval id in the attestation would not match the retrieval id derived from the job id.
- Clients must send the retrieval request to the SP defined by the job. Otherwise, the signature in the attestation would not match the public key in SP’s multiaddr.
- Clients must retrieve the given CID using the given protocol. Otherwise, the attestation payload would show different values than expected by the verifier.

How this can help other retrieval clients

I feel the proposal is generic enough to support different usages. Since the retrieval id is a hash of arbitrary data, it’s possible to pack literally anything into the retrieval id, get the SP to sign that, and later verify that the SP signed the expected field values.

Ideally, we would like to have “Proof of Retrieval”. Unfortunately, such proof is still an open problem. We think that Retrieval Attestation can get us somewhat closer to that ideal.

For example, I can imagine a browser service worker retrieving content from a dCDN like Saturn can use the retrieval attestation to attribute credit to the specific SP that provided the content to serve the request, allowing content providers to reward SPs based on how many retrievals they helped to serve.

The proposed format based on JWT can be extended to support signature chains, e.g. the outer attestation token created by an untrusted gateway can wrap an inner attestation token produced by the SP from which the gateway retrieved the content.

Technical details: retrieval id

The implementation should support arbitrary formats of retrieval ids. However, we recommend all clients use a SHA-384 hash of the original retrieval identifier.

Why a hash:
- We want requests coming from different clients to be indistinguishable from each other. This way, storage providers cannot prioritise specific classes of clients (like SPARK and Reputation Bot) to provide better performance for the clients and artificially improve their reputation scores.
- In decentralised networks, mapping a retrieval id (like GUID) to request properties is often not feasible. Instead, we must compose the retrieval id from the properties needed for subsequent verification. For example, if we want to verify that a retrieval was performed by a peer with a given peer_id using the DRAND seed from the epoch N, we can compose the retrieval id as N;peer_id, e.g. 539;12D3KooWRH71QRJe5vrMp6zZXoH4K7z5MDSWwTXXPriG9dK8HQXk. Now if we send this string as the retrieval id, then the remote party can inspect the format of the string to guess what software is making the request. Additionally, the payload can be too large for the underlying protocol. Hashing the original id solves both issues.
Why SHA-384:
- SHA-256 is vulnerable to length extension attacks. I don’t have any particular attack vector in mind; just being cautious.
- Blake3 would be a great solution, but it’s a hot new thing a thus not supported natively by browsers yet.
- SHA-384 seems to be a good compromise - it’s not vulnerable to length extension attacks, and it’s widely supported: Go, Rust and WebCrypto API in browsers and Node.js.

Technical details: attestation string

I propose using JWT for the attestation string. JWT is a widely used format with good support in many programming languages. It’s used by other projects in the Web3 space, too - most notably UCAN.

In its compact form, JSON Web Tokens consist of three parts separated by dots (.), which are:

Header
Payload
Signature

Therefore, a JWT typically looks like this: Header.Payload.Signature

JWT Header

{
  "alg": "EdDSA",
  "typ": "JWT",
  "rav": "0.1.0"
}

This is a standard JWT header, plus the extra rav field.

alg — the encryption algorithm used to create the attestation
typ — the type of token this is, this will always be ‘JWT’
rav — “Retrieval Attestation version” (so we can track the format of when it was issued)

JWT Payload

{
  "iss": "12D3KooWRH71QRJe5vrMp6zZXoH4K7z5MDSWwTXXPriG9dK8HQXk",
  "retrv_rid": "38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b",
  "retrv_cid": "bafybeib36krhffuh3cupjml4re2wfxldredkir5wti3dttulyemre7xkni",
  "retrv_proto": "graphsync"
}

iss - “Issuer” ID of who created the attestation - the public key from the libp2p identity of the peer serving the retrieval. This field is defined by the JWT standard.
retrv_rid: the retrieval id provided by the client
retrv_cid: the CID retrieved
retrv_proto: the protocol used - graphsync, bitswap or http

We expect more fields will be added in the future. For example, when a retrieval request specifies an IPLD selector, the attestation payload can include retrv_selector field describing what subset of the Merkle tree was requested.

For the initial version, we want to introduce only the fields needed by SPARK.

JWT Signature

Quoting from JWT Introduction

To create the signature part you have to take the encoded header, the encoded payload, a secret, the algorithm specified in the header, and sign that.

For example if you want to use the HMAC SHA256 algorithm, the signature will be created in the following way:

HMACSHA256(
  base64UrlEncode(header) + "." +
  base64UrlEncode(payload),
  secret)

The signature is used to verify the message wasn't changed along the way, and, in the case of tokens signed with a private key, it can also verify that the sender of the JWT is who it says it is.

Of course, we will use a different algorithm than HMAC SHA256. Maybe Ed25519? The algorithm will most likely depend on the algorithm used by the libp2p identity key-pair.

Tagging @juliangruber, @patrickwoodhead and @willscott for visibility.

willscott commented 1 year ago

Once we settle on something for boost, we should consider updating the FRC to include these semantics so that other markets also know what is expected.
For small downloads of an individual CID, the cost of signing will be more expensive than the cost of delivering the CID. Do you have thoughts on the overhead being introduced, and if this is a reasonable burden to incur?
Introducing this in bitswap / graphsync seem ‘hard’ as those are protocols that do not already provide a clear point of extension to add this function. adding it with headers in HTTP seems like the most plausible option.
Is there a reason you don’t ask for the attestation to include the response data itself?

dirkmc commented 1 year ago

The proposal sounds good to me 👍 As Will points out, probably the easiest place to add retrieval attestations is in the HTTP protocol. Another advantage of HTTP is that it is layered. You can build an http server that provides retrieval attestation, that sits in front of booster-http. That way your team won't get blocked by the Boost team's availability.

juliangruber commented 1 year ago

I wonder how far HTTP level attestation is going to get us. I agree that from a technical perspective this is the way to go. However, the main purpose of SPARK is to collect data on retrievability, and I have two concerns:

Testing all of the protocols is important. I think with just HTTP attestation we can still create a meaningful flow, but it would be great to get a feeling for a timeline for implementing attestation for graphsync and bitswap too, which I think we will need eventually. Is it out of the question, is it a short/long issue, etc.
Do enough SPs run an HTTP gateway? This is a PMF question now - if SPs will be SPARK's clients, we can tell them hey in order to participate please run an HTTP gateway. If however other parties will be clients, they will likely be interested in retrievability for more protocols

TLDR: If we start with HTTP I think we will have a good iteration platform. A timeline feel for attestation of other protocols will be useful.

willscott commented 1 year ago

@juliangruber Several of the other efforts have decided to rally around HTTP so that does seem like the place to focus at the moment - see also https://www.notion.so/Project-HTTP-UP-7a3daf6633214ae6b31c5a67b2ac17f0 if you haven't yet.

juliangruber commented 1 year ago

This is great! Do you think non-plus SPs will follow along? Or do you think it's fine to target the level of FIL+?

willscott commented 1 year ago

I think the bulk of SPs offering any form of retrieval will prefer HTTP as the protocol, as it's easiest to manage / control from their end.

bajtos commented 1 year ago

Hi folks, thank you for the constructive feedback and discussion. We had many discussions about this proposal in the last few days and need to change the course slightly.

Our plan is to support HTTP retrievals only, as that seems to be the direction for the future of Filecoin retrievals. That does not mean these attestations cannot be implemented for Graphsync and Bitswap, just that it's not something SPARK is interested in.
The JWT-based attestation tokens would consume too much bandwidth. With the sample payload I shown above, the attestation token has ~500 bytes. I'll explore different options with a more efficient representation.
Creating a new signature for each retrieval request adds a non-negligible CPU cost. We need to measure the impact of these signatures on booster-http performance and document the implications so that SPs know what to expect.

I'll post more updates as we get more clarity about what SPARK needs and what is feasible to implement.

One new feature we have already identified:

We need booster-http to create its own key-pair (identity) and add the public key to the records it advertises to IPNI, following the approach already used by advertisements for Graphsync and Bitswap. (@willscott, please correct me if I got this wrong.)

bajtos commented 1 year ago

Update: after more discussions, we have settled on an extra content-type parameter allowing clients to request an additional metadata block to be appended after the CAR stream response. I opened an IPIP to discuss the details: https://github.com/ipfs/specs/pull/431

bajtos commented 1 year ago

Let's continue the discussion in https://github.com/filecoin-project/boost/issues/1610

I am closing this issue as superseded.

filecoin-project / boost

Retrieval Attestation #1597

Checklist

Boost component

What is the motivation behind this feature request? Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context