Writing to fallback storage fails without returning an error

adam-xu-mantle commented 1 week ago

Problem When eigenda-proxy put blob, if handleRedundantWrites fails, it only logs the error and does not return the error to the caller. This could lead to some blobs having an unavailable fallback, but it goes unnoticed.

Proposed Solution Is it possible to directly return the error to the caller when handleRedundantWrites fails? In this way, the caller can ensure that all blobs have fallback storage by retrying.

// Put ... inserts a value into a storage backend based on the commitment mode
func (r *Router) Put(ctx context.Context, cm commitments.CommitmentMode, key, value []byte) ([]byte, error) {
        // ...

    if r.cacheEnabled() || r.fallbackEnabled() {
        err = r.handleRedundantWrites(ctx, commit, value)
        if err != nil {
            log.Error("Failed to write to redundant backends", "err", err)
        }
    }

    return commit, nil
}

samlaf commented 1 week ago

Interesting question. You're basically questioning the semantic of a "transaction" on the proxy, which involves potentially many writes: 1) main store: eigenda or memstore 2) secondary stoers: caches and/or fallbacks

The semantics of a put to proxy is currently:

return success if main store write succeeds, regardless of secondary stores

You'd prefer it to be:

return success iff mainstore AND secondary stores succeed

I think I would like this semantic:

return 200 if all writes succeed
return 201 if main store write succeeds, but some secondary storage write fails (include detailed info in body)
return 500 if main store write fails

Now regarding how to deal with secondary storage write failures, there are 2 alternatives:

client-side driven: we add a new put endpoint to put to secondary storage only, and get client to retry on this endpoint on 201s
server-side driven: when a secondary storage errors, we get some background process to retry until it succeeds (eventual consistency)

Do you have any preference here? What is your intended use case for this feature?

adam-xu-mantle commented 1 week ago

Client-side driven is the preferred option. It is easier to implement, and the current op stack treats non-200 status codes as errors and will retry PUT blob. For partially successful, using 206 status code might be better.

samlaf commented 1 week ago

@adam-xu-mantle client retries on 201/206 would not be ideal right now because we don’t have idempotent requests, so you would be resending the same blob to eigenDA a second time, and paying for it, just to fill the fallback cache.

we are planning a small rearchitecturw to prevent double writing of the same blog but not sure on timeline atm.

adam-xu-mantle commented 6 days ago

@samlaf Can this be a configuration? This way, users who are more concerned about data availability can retry after a 201/206 error.

samlaf commented 6 days ago

@adam-xu-mantle you mean making returning 201/206 instead of 200/500 configurable? I personally think I’d prefer sticking to one protocol. I’m fine with returning 201/206 and letting client decide what they want to do.

@epociask thoughts?

epociask commented 4 days ago

I like this idea but worry that we may be breaking the OP Alt-DA client<-->server spec by returning 201 vs 200: https://github.com/ethereum-optimism/optimism/blob/v1.7.6/op-plasma/daclient.go#L99-L101

we could add configuration but more flags - larger namespace - higher cognitive complexity.

@adam-xu-mantle @samlaf what if we instead just added observability via metrics around secondary storage interactions vs client side logging?

adam-xu-mantle commented 4 days ago

@epociask If secondary store failed is just observed via metrics, it's still difficult to fix the secondary store data.

samlaf commented 4 days ago

@adam-xu-mantle thinking through this some more, to see whether there isn’t another solution that could satisfy you. You’re main worry is eigenDA disperser going down, and then trying to retrieve a blob which wasn’t written to fallback db right?

Would #90 solve your problem?

epociask commented 4 days ago

@adam-xu-mantle why are metrics insufficient? IIUC in the current proposed solution you'd understand a secondary storage failure via grep'ing op-batch/op-node logs which would already require some peripheral wrapping to make the error capture loud.

samlaf commented 3 days ago

@epociask metrics are not meant for this purpose. They are just meant to be parsed and periodically and shipped to some db and then visualized with grafana/datadog/etc. Would be an anti-pattern I feel to have logic depend on the output of a metric endpoint. Plus it's super inefficient beaucoup a /metrics endpoint typically dumps all metrics in a string like pattern, so you'd need to parse it etc. Just doesn't feel like the right tool here.

adam-xu-mantle commented 3 days ago

@samlaf #90 does not resolve this issue. The EigenDA operator node going down could also result in failures to obtain blobs (if an unexpected bug occurs). This highlights the secondary store's value. However, if the EigenDA backend becomes temporarily unavailable and the blobs in the secondary store are not accessible, the actual value of the secondary store is diminished."

samlaf commented 1 day ago

@adam by "The EigenDA operator node going down" you seem to be saying that you are only talking to a single node, which is not the case. This might be a typo, but just in case I'll explain; eigenda works differently from celestia; blobs are encoded with RS encoding, and so you need to read any fraction of the data to be able to reconstruct the original data. You would need the entire operator set to go down (or at least 30%ish of the network depending on security and encoding parameters) to not be able to retrieve from da nodes.

But agree that for max security it's best to also make sure the fallback store is written to.

epociask commented 18 hours ago

The preconditions for this failure scenario are rather unlikely:

EigenDA disperser is unavailable AND
majority operator set is unavailable AND
secondary storage target failed to write to backend due to non-recorded error

If expressed and a blob truly become unavailable then there could certainly be adverse network effects; e.g:

bridge safety: An honest node fails to reference pre-image contents due to data unavailability when assigned a one step proof during a dispute claim dissection.
liveness: nodes syncing from confirmed or final chain views could halt due to inaccessibility of batch contents. If the final-confirmed delta continues expanding over some max time then the sequencer's backlog tx contents could be evicted causing a reorg to any rollup node syncing from an unsafe sequencer head.

If we wanna further quantify it could be fair to say:

likelihood=very low (1)
impact=very high (5)

so somewhere around a medium-low using the 5 x 5 matrix. @samlaf why do you feel like metrics are insufficient? Is it not possible to observe when secondary insertions fail and manually replay them - analogous to web2 kafka where dlqs are managed and drained? This risk feels low enough where counter-measures can optionally be supported via a semi-peripheral flow. It's also worth noting that we are working on optimizing the dispersal latency and have begun to decouple primary write execution from secondary write execution - meaning that capturing a secondary backend write failure via server response likely won't be possible in the future.

samlaf commented 15 hours ago

@ethen you can use metrics for this, but I just don't think it's the right tool.

Re "have begun to decouple primary write execution from secondary write execution", what does this mean? That writing to secondary storage will be done asynchronously? Can we hold off on this, I feel like the request is valid. I think ultimately we should get rid of fallback, and each cache enabled should have a parameter to decide whether to write to it synchronously or asynchronously. Writing synchronously and failing means the request failed, and the client can then retry sending, as adam-xu is asking here.

Layr-Labs / eigenda-proxy

Writing to fallback storage fails without returning an error #173