filecoin-project / boost

Boost is a tool for Filecoin storage providers to manage data storage and retrievals on Filecoin.
Other
109 stars 65 forks source link

QueryAsk v2 Protocol #666

Open hannahhoward opened 2 years ago

hannahhoward commented 2 years ago

Goals

Enable discoverability of HTTP retrieval support

The primary changes for v2 protocol are:

Query:

Response:

For now the newly design protocol makes all additional information protocol specific

How

The schema for the QueryAsk v1 request / response (in IPLD schema -- it's encoded as DAG CBOR) is as follows:

# QueryParams - V1 - indicate what specific information about a piece that a retrieval
# client is interested in, as well as specific parameters the client is seeking
# for the retrieval deal
type QueryParams struct {
    PieceCID nullable Link  # optional, query if miner has this cid in this piece. some miners may not be able to respond.
}

# Query is a query to a given provider to determine information about a piece
# they may have available for retrieval
type Query struct {
    PayloadCID  Link 
    QueryParams QueryParams
}

# QueryResponseStatus indicates whether a queried piece is available
type QueryResponseStatus enum {
  # QueryResponseAvailable indicates a provider has a piece and is prepared to
    # return it
    | QueryResponseAvailable ("0")

    # QueryResponseUnavailable indicates a provider either does not have or cannot
    # serve the queried piece to the client
    | QueryResponseUnavailable ("1")

    # QueryResponseError indicates something went wrong generating a query response
    | QueryResponseError ("2")
} representation int

# QueryItemStatus (V1) indicates whether the requested part of a piece (payload or selector)
# is available for retrieval
type QueryItemStatus enum {
    # QueryItemAvailable indicates requested part of the piece is available to be
    # served
    | QueryItemAvailable ("0")

    # QueryItemUnavailable indicates the piece either does not contain the requested
    # item or it cannot be served
    | QueryItemUnavailable ("1")

    # QueryItemUnknown indicates the provider cannot determine if the given item
    # is part of the requested piece (for example, if the piece is sealed and the
    # miner does not maintain a payload CID index)
    | QueryItemUnknown ("2")
} representation int

type Address Bytes
type TokenAmount Bytes

# QueryResponse is a miners response to a given retrieval query
type QueryResponse struct {
    Status        QueryResponseStatus
    PieceCIDFound QueryItemStatus # V1 - if a PieceCID was requested, the result

    Size Int

    PaymentAddress             Address # address.Address to send funds to -- may be different than miner addr
    MinPricePerByte            TokenAmount
    MaxPaymentInterval         Int
    MaxPaymentIntervalIncrease Int
    Message                    String
    UnsealPrice                TokenAmount
}

The proposed v2 schema is:


# QueryKind specifies whether you are interested in a payload or a piece
type QueryKind enum {
    | Piece ("pc")
    | Payload ("pl")
} representation string

# Query is a query to a given provider to determine information about a piece
# they may have available for retrieval
type Query struct {
        Kind  QueryKind
        # If kind is Payload, response will error if not included
    PayloadCID optional Link
    # If kind is Piece, response will error if not included
    PieceCID optional Link 
}

# QueryResponseStatus indicates whether a queried piece is available
type QueryResponseStatus enum {
  # QueryResponseAvailable indicates a provider has a piece and is prepared to
    # return it
    | QueryResponseAvailable ("0")

    # QueryResponseUnavailable indicates a provider either does not have or cannot
    # serve the queried piece to the client
    | QueryResponseUnavailable ("1")

    # QueryResponseError indicates something went wrong generating a query response
    | QueryResponseError ("2")
} representation int

# QueryItemStatus (V1) indicates whether the requested part of a piece (payload or selector)
# is available for retrieval
type QueryItemStatus enum {
    # QueryItemAvailable indicates requested part of the piece is available to be
    # served
    | QueryItemAvailable ("0")

    # QueryItemUnavailable indicates the piece either does not contain the requested
    # item or it cannot be served
    | QueryItemUnavailable ("1")

    # QueryItemUnknown indicates the provider cannot determine if the given item
    # is part of the requested piece (for example, if the piece is sealed and the
    # miner does not maintain a payload CID index)
    | QueryItemUnknown ("2")
} representation int

type Address Bytes
type TokenAmount Bytes

type ProtocolName enum {
   | GraphsyncFIlecoinV1 ("graphsync/v1")
   | HTTPFilecoinV1 ("http/v1")
} representation string

# QueryResponse is a miners response to a given retrieval query
type QueryResponse struct {
    Status        QueryResponseStatus
    Protocols {ProtocolName:Any}
}

# Structure of Graphsync Protocol Data
type GraphsyncFilecoinV1Response struct {
    Size Int
    PaymentAddress             Address # address.Address to send funds to -- may be different than miner addr
    MinPricePerByte            TokenAmount
    MaxPaymentInterval         Int
    MaxPaymentIntervalIncrease Int
    Message                    String
    UnsealPrice                TokenAmount
}

# Structure of HTTP V1 Response
type HTTPFilecoinV1Response {
      URL String
      Size Int
}
jacobheun commented 2 years ago

Overall this looks reasonable. I don't think we need QueryKind though, we can infer this from the Params:

In your v2 proposal you got rid of the usage, but not declaration of QueryItemStatus, is that intentional? It does seem unnecessary to have, but I don't know what the V1 intent was for distinguishing QueryItemStatus versus QueryResponseStatus. They seem redundant.

ribasushi commented 1 year ago

If only PieceCid is provided, provide a response for the piece retrieval with available options. (http only)

Just want to flag not to hardcode this http only limitation as part of the protocol itself. A piece is itself a tree-hashed payload, there could very well be a near future where the piece blob itself it is retrievable over graphsync/bitswap.

cc @mikeal @rvagg

jacobheun commented 1 year ago

Just want to flag not to hardcode this http only limitation as part of the protocol itself.

Totally agree, the ( ) were meant to capture the current state of support and not future support. We should return all available options.

s0nik42 commented 1 year ago

I've got general concern about the query-ask in general and the retrieval/storage sucess rate. Its a bit wider than the scope of this Issue, but I think it worth discussing it here.

Storage Deals In V1, a client looking to store data is doing (lets take estuary) :

  1. get-ask
  2. Send a storage-proposal accordingly Only at that point the dealfilter is triggered. the deal can be refused based on many conditions (pipeline is full, price, size, blacklist addresses); the ability for a SP to manage per client pricing, deal acceptance conditions and deal flow is what we provide with CIDgravity.

This asymmetry reduces drastically the success rate for client and SPs reputation.

Retrieval Deals

  1. Same concern as for Storage Deals
    • When a retrieval proposal is priced at 0. The proposal is not signed by any Filecoin address. Which is fine for public data, but for private date, giving the ability for client to send any proposal, will allow authentification on retrieval which is super important especially for enterprise business (like shoah fundation).

Putting all of this together :

  1. I think the current deal-filter or a get-ask filter should be triggered during the get-ask with more parameters (same parameters as the storage-proposal). Doing so the success rate of the deal proposal should highly increase giving more satisfaction to all participants (SP and client). This point has already been discussed with the arg team about a year ago @brendalee know more about it.

  2. Client should be able to sign a retrievalProposal to authenticate the proposal for SP to apply the correct pricing/dealmaking conditions/access control. Today the only way to do that is based on peerID and it's really not reliable.

hannahhoward commented 1 year ago

@s0nik42

Thanks for these awesome suggestions.

The biggest software challenge with running the filters in the query phase currently is that the interface exposed to run the deal filter, at least the level of the markets software, takes an actual deal proposal. So we'd have to essentially synthesize one to run the filters or create a new DealFilterParams type struct to capture all the levers a provider might want to apply to decide whether to take a deal. And we'd also need to probably add some parameters to the ask protocol (on the storage side at least) to synthesize an accurate representation of what the deal is likely to look like (for example, is it an offline deal -- something not yet obvious in the current setup). When @dirkmc is back he can probably think through this more in depth.

I definitely agree on signing retrieval deals though it in the case of retrieval we can keep it optional for the purposes of public data

For the purpose of query ask v2, I think the forward thinking feature I can add, to avoid another breaking protocol change, is to support signatures when you send the query.

dirkmc commented 1 year ago

@s0nik42 would it be sufficient for the ask protocol to return a boolean indicating whether it's a public vs private endpoint?

I'm imagining something like:

  1. Client sends query ask to several SPs
  2. SPs respond with ask (including boolean indicating public / private)
  3. Client sends data request to SP that is public and meets client's price conditions
nicobao commented 1 year ago

I think we should consider using the concept of W3C DID (https://www.w3.org/TR/did-core/) and UCAN (https://ucan.xyz/) for auth and access control for private data.

Let me explain the use case through a user story (example usage of DID/UCAN is approximate and remains to be defined).

In order for Filecoin to stay censorship resistant, it should always be possible for SPs to accept retrieval deals from unauthorized Clients. However, retrieving data to unauthorized entities is likely to affect the SP reputation negatively, depending on context.

Advantages:

Drawbacks:

cc @gobengo @bmann @expede you should be interested in joining this discussion. I may have made mistakes in the way DID/UCANs should be used, as I never had to opportunity to experiment with it yet, so feel free to correct me. I'll also notify Patrick Woodhead who is, I believe, interested in introducing DIDs and Verifiable Credentials for reputation within Retrieval market.

bmann commented 1 year ago

Thanks @nicobao! Yes, we are proposing to do a proof of concept that would be WNFS (our encrypted file system) end-to-end private data on Filecoin, to open up all private use cases.

At the very least, a standard way to combine DIDs, private keys, and UCANs for access control in such a way that one entity can place data, meant for another entity to retrieve.

Sidetree is not needed, and we will also be working on did:fil (or did:pkh, which is broadly used for any EOA blockchain keys).

Happy to talk more about this -- I think there are folks from DAGHouse cc @mikeal who would be interested in this.

mikeal commented 1 year ago

i personally feel a little ill-equipped to try and “define a Filecoin DID protocol” that uses UCAN, right now.

we’re doing a lot with UCAN right now, and we’re exploring transport protocols w/ it https://purrfect-tracker-45c.notion.site/fast-ptp-368da03e9c91460f9dcb3da080f439d2 and can iterate pretty quickly with one that is in production and servicing a lot of large data reads between large providers.

i want to have that experience before trying to define something this big, that cuts across so many use cases and concerns. i know the “large provider” problem pretty well, and we are still finding better ways to leverage UCANs every week to solve those problems. it’s exciting stuff, but definitely changing fast and we’re still finding best practices.

nicobao commented 1 year ago

@bmann is there a public repo for the proof-of-concept you work on?

I suppose for now we can leave auth outside of Filecoin until it appears clearer how to introduce it.

@s0nik42 would it be sufficient for the ask protocol to return a boolean indicating whether it's a public vs private endpoint?

I'm imagining something like:

1. Client sends query ask to several SPs

2. SPs respond with ask (including boolean indicating public / private)

3. Client sends data request to SP that is public and meets client's price conditions

As access control granularity is all implemented off-chain for now, the boolean you mention is fine to me. @s0nik42 what do you think?

As @s0nik42 said, it would be nice if Clients send their Filecoin address in the retrieval deal proposal.

willscott commented 1 year ago

We have a space for protocol metadata in the indexing announcements https://github.com/filecoin-project/index-provider/blob/main/metadata/graphsync_filecoinv1.ipldsch

It would be great if the indexer presence already implicitly indicates 'public' over 'private', and we can extend those advertisements with price conditions the provider would be willing to offer retrieval at.

if we can do that, then we can avoid the additional negotiation round-trip, and have a client much more likely to be able to go directly to an SP it can be successful in making a retrieval with.

dirkmc commented 1 year ago

@willscott agreed - that was going to be my next suggestion: instead of an ask that returns public/private, just don't advertise private cids

expede commented 1 year ago

UCANs are still work-in-progress and aren't standardized yet

To perhaps clarify, there is a standard at https://github.com/ucan-wg/spec, but we're still releasing new versions every few months.

we are still finding better ways to leverage UCANs every week to solve those problems

I just wanted to echo this as well. Aside from the standardization process, there's some pattern discovery happening in the community right now. We can pull a lot from the eRights and SPKI worlds, but there's lots of interesting experimentation happening.

We're also going to be exploring topic related to UCAN+Filecoin pretty heavily as part of the IPVM working group.

hannahhoward commented 1 year ago

Closing until we find a new time to finish the design.

hannahhoward commented 1 year ago

NM, I'm going to leave it open simply for disucssion for a later point, but for now our immediate needs for HTTP retrieval are resolved, so this is now an open design thread.