ethereum-optimism / specs

OP Stack Specifications
https://specs.optimism.io
Creative Commons Zero v1.0 Universal
102 stars 91 forks source link

AltDA API: support failover 503 and partial success 201 responses #434

Open samlaf opened 1 month ago

samlaf commented 1 month ago

Is your feature request related to a problem? Please describe.

We are thinking about altda->ethda failover scenarios, where if altda is down for whatever reason, the op-batcher could just switch back to ethda for some period of time. Starting this issue to describe potential approaches to achieving this failover, which would require changes to the altda server API, which currently only describes happy path (200) responses.

503 error code

Our current thinking is to keep the op-batcher implementation simple, and push the decision as to when to failover into the altda-server. This way different teams can then experiment with different approaches, potentially specific to their own da layer. The op-batcher would simply submit its blobs to the altda-server, and if it ever receives a 503 (service unavailable), it would failover and start submitting frames via ethda. 2 potential ways to decide when to retry submitting to altda:

  1. time-based: 503 could return a RETRY_AFTER value, and the op-batcher could start resubmitting to altda after this many seconds
  2. channel-based: channels would have a "da_method" state, and upon receipt of a 503, op-batcher would change the state of the channel the frames belong to, to "da_method: ethda", and then fallback to the current auto logic of deciding between blobs and calldata at submission time.

We have a preference for 2 as it seems cleanest and would (we think) require minimal changes to the current op-batcher implementation. We will submit a PR shortly, but would appreciate any comments/criticisms of this approach that could help improve this in any way.

201 error code

We would also like to support another error code, which is useful for ensuring fallback storage has been written to. See the discussion here for full details, but to summarize, some rollups want for max assurance to make sure that the blobs are not only written to eigenDA, but also to a secondary "fallback" storage like S3, such they have 100% guarantee that they can retrieve the blob at a later time.

Basically this means that a POST to /put/ returns 200 iif it writes to both main (eigenda) and secondary/fallback (s3) storage. If s3 is down for whatever reason or the write fails, then we return a 201 to ask the client to retry.

Describe the solution you'd like

Created an openapi spec to precisely describe what we would need: https://app.swaggerhub.com/apis/SAMLAF92/op_altda_server/1.0.0

openapi: 3.0.0
info:
  title: OP AltDA Server API
  version: 1.0.0
  description: API for storing and retrieving preimages with hex-encoded commitments (see https://specs.optimism.io/experimental/alt-da.html for more details)

paths:
  /put:
    post:
      summary: Store a preimage on a blockchain based DA layer and get a hex-encoded commitment.
      description: >
        Because commitments can include the block height, hash or depend on onchain data, the commitment cannot be computed prior to submitting it to the DA Layer. 
        If using a simple commitment scheme, use the /put/<hex_encoded_commitment> route instead.
      requestBody:
        required: true
        content:
          application/octet-stream:
            schema:
              type: string
              format: binary
      responses:
        '200':
          description: Successful operation - written to both main and (optionally) secondary storages
          content:
            application/octet-stream:
              schema:
                type: string
                format: binary
        '201':
          description: Partially successful operation - written to main storage only. Client should resend the same request to make sure it is successfully written to needed fallback storages.
          content:
            application/json:
              schema:
                type: object
                properties:
                  commitment:
                    type: string
                    description: Hex-encoded commitment
                  status:
                    type: string
                    enum: [partial]
                  message:
                    type: string
        '500':
          $ref: '#/components/responses/InternalServerError'
        '503':
          $ref: '#/components/responses/ServiceUnavailable'

  /put/{hex_encoded_commitment}:
    post:
      summary: Store a preimage with a pre-computed hex-encoded commitment on a content addressable storage layer like IPFS or any S3 compatible storage
      parameters:
        - in: path
          name: hex_encoded_commitment
          required: true
          schema:
            type: string
          description: Hex-encoded commitment for the preimage
      requestBody:
        required: true
        content:
          application/octet-stream:
            schema:
              type: string
              format: binary
      responses:
        '200':
          description: Successful operation
        '400':
          description: Bad request - if the provided commitment doesn't match the preimage
        '500':
          $ref: '#/components/responses/InternalServerError'
        '503':
          $ref: '#/components/responses/ServiceUnavailable'

  /get/{hex_encoded_commitment}:
    get:
      summary: Retrieve a preimage by its hex-encoded commitment
      parameters:
        - in: path
          name: hex_encoded_commitment
          required: true
          schema:
            type: string
          description: Hex-encoded commitment of the preimage to retrieve
      responses:
        '200':
          description: Successful operation
          content:
            application/octet-stream:
              schema:
                type: string
                format: binary
        '404':
          description: Not found - if the commitment doesn't exist
        '500':
          $ref: '#/components/responses/InternalServerError'

components:
  responses:
    ServiceUnavailable:
      description: >
        Service unavailable. When received, clients should fallback and submit their blobs to Ethereum to be safe. 
        They can try resubmitting blobs to altda via this server after <retry_after> seconds, if present.
      content:
        application/json:
          schema:
            type: object
            properties:
              error:
                type: string
              retry_after:
                type: integer
                description: Seconds until client should retry. This field is optional.
            required:
              - error
    InternalServerError:
      description: >
        Internal Server Error. This indicates a problem with the current (proxy) server. 
        The client should consider this request as failed and may retry immediately with the same or a different server instance.
      content:
        application/json:
          schema:
            type: object
            properties:
              error:
                type: string
                description: A message providing more details about the error.
            required:
              - error

Describe alternatives you've considered

Additional context

samlaf commented 4 weeks ago

Thinking through this might also add a 400 error code (for eg if blob submitted is too large) to let op-batcher know that its settings are most probably wrong.