celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
895 stars 872 forks source link

bug: blob.getAll returns ERROR when no blob is found, but is not error #3292

Open jcstein opened 1 month ago

jcstein commented 1 month ago

Celestia Node version

v0.13.2

OS

macOS

Install tools

docs

Others

No response

Steps to reproduce it

make a blob.getAll request from a height and namespace which has no blobs, the CLI, for example, returns:

$ celestia blob get-all 42069 0x426942694269426942 --node.store $HOME/.celestia-light-mocha-4
{
  "result": "getting blobs for namespace(0000000000000000000000000000000000000000426942694269426942): blob: not found\nblob: not found"
}

Expected result

from @distractedm1nd: blob.GetAll should not return an error when there is no data at that height under that namespace. Its not an error. It is just empty data

Actual result

The logs for light node show an ERROR, along with WARN and INFO logs for the request:

Relevant log output

2024-04-03T14:16:05.630-0400    INFO    blob    blob/service.go:261 requesting blob {"height": 42069, "namespace": "0000000000000000000000000000000000000000426942694269426942"}
2024-04-03T14:16:05.630-0400    WARN    rpc go-jsonrpc@v0.3.1/handler.go:444    error in RPC call to 'blob.GetAll': getting blobs for namespace(0000000000000000000000000000000000000000426942694269426942): blob: not found
blob: not found
2024-04-03T14:16:05.630-0400    ERROR   rpc go-jsonrpc@v0.3.1/handler.go:484    error and res returned  {"request": {"jsonrpc":"2.0","id":1,"method":"blob.GetAll","params":[42069,["AAAAAAAAAAAAAAAAAAAAAAAAAABCaUJpQmlCaUI="]],"meta":{"SpanContext":"AAAWWTEZ3qbnKkPLLH7zpAsLAUblHeR8KD4/AgA="}}, "r.err": "getting blobs for namespace(0000000000000000000000000000000000000000426942694269426942): blob: not found\nblob: not found", "res": []}

Notes

No response

vgonkivs commented 1 month ago

Duplicate https://github.com/celestiaorg/celestia-node/issues/3192.

We had an internal discussion and agreed that returning both empty value and error is not a go idiomatic way, bc the issue is mainly motivated by Rust and based on the Rust semantics. As our implementation was done using golang, we will stick to the rules that have been defined. Any short-term workarounds can lead to more issues and misunderstandings.

joroshiba commented 1 month ago

Issue has nothing to do with any programming language, it's an API semantics issue.

Semantically the correct response requesting an object that does not exist should be either a success empty response (this is common for search style requests), or more commonly an error code which indicates explicitly not found. Ie an HTTP 404 or a gRpc NOT_FOUND 4, unfortunately JSON RPC offers no such error code.

JSON RPC provides very few error codes, although Celestia could choose to define sets of internal server errors that could define this, although I'd argue this isn't an internal error (as indicated by not being a 5XX HTTP code in a restful API). You could argue this is a -32007 Invalid Params instance, but the parameters aren't invalid the data doesn't exist. An instance of invalid params could be requesting blob data at a height which doesn't exist yet. That parameter makes no sense. I'm doing a search on a block that exists for a set of data and it came up empty. Given no valid error code an error should not be returned to the user.

Especially in the instance of GetAll if any of the namespaces I request data from return data it's not malformed, the parameters should not be considered invalid. If GetAll were to work this way it should consider ANY missing data an invalid request. That's not particularly useful. I can see an argument in a simple Get request that a namespace not existing in the block is no different than the block not existing. Again there is an information asymmetry problem though. I get an error, why?

The core problem here is that the error messages currently must be relied upon, making any changes to the string messages is a breaking change. Error codes, when adequate, provide a much more robust and standardized way to handle common errors. Unfortunately JSON RPC is a substandard spec in this regard. This could be solved by utilizing the data field on errors to provide structured information. This structure could be standardized across node API errors and have codes used to augment the lacking JSON RPC error codes. This is also commonly done in REST APIs where you can pass an internal error code which provides user more information through the data field. Generally: Message fields on API errors are intended to be human readable information, and they are being relied upon to pattern match to specific. I consider this a strong API anti-pattern, and it's one you are forcing all users of the API to make.

The design of the API isn't about semantically correct Go, it's about users being able to build robust applications on top of it in whatever language they choose.

vgonkivs commented 1 month ago

I don't see any anti-pattern here. ErrBlobNotFound indicates that the blob under the requested namespace and height was not found(because there are no blobs under this namespace in the requested header). Yes, JSON RPC has weak support for the status codes, but the difference between 404 and ErrBlobNotFound is negligible, purely semantical and non-functional.

joroshiba commented 1 month ago

Is ErrBlobNotFound returned via the API any way other than a string message? If this is passed via some code on the error data field I'd be happy with that.

Looking through the codebase it doesn't seem to be ErrBlobNotFound is an internal variable, with a fixed string message.

Perhaps this is a broader issue: the problem is there are no documented errors, there's no way to know them without scouring the code base. In order to parse as a consumer I must do string matching. JSON RPC provides a mechanism via data on errors to provide structural useful errors which can be documented for API consumers. This is not utilized by celestia-node. My immediate issue would be solved by GetAll not returning an error, but the actual problem is that errors are undocumented, and require string parsing because they are inadequately coded and speced.

Wondertan commented 1 month ago

I agree that's a documentation issue in the first place. If we did a better job documenting this and other errors, you would not have been so frustrated with this I believe.

SuperFluffy commented 1 month ago

What is Celestia-Node's recommendation for client handling of blobs not being found? Is error code 1 only used for blobs not found? (IIRC no, the error code is used for other errors)

Should clients match on the error message? Will Celestia-Node provide stability guarantees for the blob-not-found error message?

Extending on this: if a client issues a single request to fetch blobs under 2 namespaces foo and bar, but only foo contains blobs, should this return an error or a successful result? Both design choices are in principle valid, but IMO this too should be formalized.

jcstein commented 1 month ago

I still do not think an empty value should return an error, as it is not an error. There is just no share at that namespace and that height.

SuperFluffy commented 1 month ago

I actually ran into this scenario:

  1. blob.GetAll(namespace: [foo]) => 1 blob,
  2. blob.GetAll(namespace: [bar]) => Error(blob: not found),
  3. blob.GetAll(namespace: [foo, bar]) => Error(blob: not found).

The Celestia Node RPC docs state the following (https://node-rpc-docs.celestia.org/?version=v0.13.3#blob.GetAll):

GetAll returns all blobs at the given height under the given namespaces.

This is a bit confusing.

jcstein commented 1 month ago

So @SuperFluffy in this case, at least foo's blob should be returned, and it appears that no blob for bar results in an error when querying [foo, bar]?

vgonkivs commented 1 month ago

After thinking about this problem for some time, I think it makes sense to return both empty values in GetAll request. Motivation:

Before opening any PR, let's agree on all cases that may happen during GetAll request: 1) Positive case: blobs were found under requested namespaces -> ([]blob.Blob, nil) 2) Negative case: blobs were not found under requested namespaces -> ([]blob.Blob{}, nil) 3) Mixed case: a) a part of the blobs was found during GetAll request within multiple namespaces -> ([]blob.Blob, nil) b) internal errors happen for some namespaces -> ([]blob.Blob, err*), where err contains the err itself + namespace.

renaynay commented 4 weeks ago

a part of the blobs was found during GetAll request within multiple namespaces -> ([]blob.Blob, nil)

I think this part needs to be re-thought as well. IMO, it was a mistake to allow GetAll to take multiple namespaces. If we are already going to "break" this method, I would prefer to break this method entirely and limit it to one namespace so we don't have to handle this ugly case.

Regarding the error itself, I still support returning ErrNotFound in the case there are no blobs found under that namespace. The caller should know whether it expects blobs to exist at that height or not.

Callers that do not know whether blobs exist under that namespace (e.g. rollup full nodes that may be looking for txs censored by sequencer) would still expect that err for every height for which there should not be blobs for the namespace. It's cleaner than checking that both return values are nil.

ErrNotFound should just mean "made a successful network request and the blob / namespace does not exist for that block"

SuperFluffy commented 4 weeks ago

Maybe the following Google guidelines for API design could be an inspiration:

  1. https://google.aip.dev/132 (list methods)
  2. https://google.aip.dev/231 (batch methods)

Specifically the last point under guidance in AIP 231 is interesting here:

  • The operation must be atomic: it must fail for all resources or succeed for all resources (no partial success). For situations requiring partial failures, List (AIP-132) methods should be used.
    • If the operation covers multiple locations and at least one location is down, the operation must fail.

So blob.GetAll could either be batch-like (blobs under all desired namespaces must be present, error otherwise), or list-like (blobs could be present for some namespaces and absent for others; errors are ignored, present blobs are returned).

Right now blob.GetAll is batch-like, which as I said before is in principle acceptable.

I think most crucial is not necessarily evolving the API (and introducing breaking changes), but to sharpen the docs and give examples that document expected behavior. Right now https://node-rpc-docs.celestia.org/?version=v0.13.3#blob.GetAll only documents the simplest happy case: blob exists for the given namespace. But it should probably also mention a) the error if the blob does not exist, b) the happy case for 2 namespaces, c) the case where two namespaces are queried but one namespace has resources while the other doesn't.

Maybe the reported error could also have more information, noting which namespaces failed/had no blobs.

Wondertan commented 4 weeks ago

I agree that returning an error is a semantically correct approach in Golang. The most common example in stdlib would be io.EOF error. Returning both values as nil is an anti-practice in Go world.

Our API is designed with Golang semantics. Celestia-node clients in other languages that want an idiomatic feel would have to abstract Golang flavorings away or use different protocol implementations in the language of interest, like lumina for Rust. I would argue that other protocol implementations should similarly optimize APIs for their language of choice, like celestia-node is doing for Go.

Our duty as celestia-node team though is to document our API well, which is not the case and is the root cause here.

Wondertan commented 4 weeks ago

I don't have a strong opinion on GetAll batching, and I think that's a tangential problem discussed here that should be discussed elsewhere. Also, I agree with @SuperFluffy that both are acceptable.

Wondertan commented 4 weeks ago

@SuperFluffy, @joroshiba, would improving the API documentation be an acceptable solution for the Astria team?

SuperFluffy commented 4 weeks ago

@Wondertan better docs are definitely desirable.

Even better would be a dedicated (and documented, stable, supported) JSONRPC error code for blob: not found and potentially a structured error in the JSONRPC data field.

The specific semantics of your implementation are orthogonal to this at the end of the day. If I were to write a batch-like API in Rust I would certainly return a Result::Err variant to short circuit the fetch.

On the other hand, if I write a celestia-node client using Golang, how should I match on the error? The JSONRPC spec that your API intends to follow has error codes for reporting what exactly failed (well, that and the data field). These API specs exist precisely because they are supposed to transcend the semantics of the particular language you are using.

distractedm1nd commented 4 weeks ago

I don't know why our API should be strictly adhering to golang semantics. If people want to use the golang errors, just import node and call the methods on the services directly, there is no need for an RPC in that case.

Wondertan commented 4 weeks ago

On the other hand, if I write a celestia-node client using Golang, how should I match on the error?

There are two options:

Our Golang API users are also expected to import the error to handle such a case. If this case was well documented, it would have been straightforward to use and implemented new clients.

SuperFluffy commented 4 weeks ago

Where does the JSON-RPC API define this error?

I'm a bit confused as to which API you're talking about.

Celestia-Node officially documents a jsonrpc API, which is the one we are interacting with.

I'm asking for JSONRPC semantics, while you keep referring to your go library. Is the Celestia-Node go library the only way one is supposed to interact with Celestia-Node?

Also, is Celestia-Node commiting to a stable message field then?

liamsi commented 4 weeks ago

This is issue is really painful to read.

@Wondertan @vgonkivs Golang semantics have little to do with JSON-RPC APIs.

Please, help the team with either 1) how to handle the error properly (if string matching that means committing to not changing that err string and also making it part of the docs; ideally with an example) or 2) return an empty result instead of an error here. The latter could even happen at the JSON-RPC code layer (if err == ErrBlobNotFound return empty slice instead of passing the error to that layer/the resulting json).

Thank you @joroshiba for the analysis here and the general feedback @SuperFluffy @jcstein and @distractedm1nd.

Also this is really confusing and deserves its own issue. Can someone open that please?

Wondertan commented 4 weeks ago

@SuperFluffy

Where does the JSON-RPC API define this error?

It doesn't, and that's the API documentation gap I'm talking about.

I'm a bit confused as to which API you're talking about.

I mean the API of the BlobModule, with this error defined next to the interface.

I'm asking for JSONRPC semantics, while you keep referring to your go library. Is the Celestia-Node go library the only way one is supposed to interact with Celestia-Node?

Any API fundamentally has to adhere to some semantics consistently. The JSON RPC is very minimal and doesn't have any. You can apply anything over it. You can even pass protobufs over it if you really want to. When you talk about JSON RPC, you have to have some additional structure over it, and this is why there is a connection with Golang and there is something to do with it.

Also, is Celestia-Node commiting to a stable message field then?

We are not v1, but there is no reason for this error message to be changed.

Wondertan commented 4 weeks ago

Golang semantics have little to do with JSON-RPC APIs.

Unfortunately, they are. That's the nature of JSON-RPC. It's so minimal and unopinionated that you have to have some additional structure over it for things like error handling in this case.

Wondertan commented 4 weeks ago

The decision to adhere to Golang semantics can be changed though. Back in the day, we saw Golang users like Rolkit as our primary users, and we were writing Golang, so it's easy to stick to it.

We can revisit this decision over Interfaces WG and discuss alternative formats to run over JSON RPC or move to a different API altogether.

Wondertan commented 4 weeks ago

@joroshiba, @SuperFluffy, we just discussed internally the notion of Canonical API. The CIP standardized API that every node/consensus client/node has to support. That API would be defined using IDL with strong cross-language support, like proto3/gRPC. This layer on top will hide the internals of the native celestia-node API like we are facing in this issue.

We would love your thoughts on this idea and any design requests from the beginning. Let's design a better way of interfacing with celestia together.

Wondertan commented 4 weeks ago

Besides, I got feedback that my communication here was somewhat dismissive, and I want to apologize for that. Reading through the threads, I think I can get a sense of it. Naturally, I am not the best person when it comes to empathy, but I do appreciate it when people deeply care about technology, and you are the perfect example of what caring is, even about such small low-level issues.

I hope that the long-term solution above will address your concern by addressing the cross-language barrier.

joroshiba commented 4 weeks ago

Appreciate the understanding and excited to move forward with improving the API. Related to this wanted to respond to this idea, because I think it's important as we rework the API to consider what the user journeys are, and how we want the node to be utilized. The API is a product, the APIs we build fill usecases, but also will push people towards specific design decisions. Even if technically workable, the choice to return an error implies the user is doing something wrong.

As pointed out by @renaynay the choice to return an error here suggests that users should know what heights data are posted at.

Regarding the error itself, I still support returning ErrNotFound in the case there are no blobs found under that namespace. The caller should know whether it expects blobs to exist at that height or not.

In this example: I don't think the caller should have to know what heights it expects blobs to exist at. If I have adequate DA to know what height my data got posted at, why not just use that DA to give me the actual data? Instead I can follow and check "do I have data" at each block height instead of relying on alternative DA to tell us where the data is. It seems that is not an intended use case of the current API, but it seems to me as though this is how people should use DA. Things to consider as we move towards the next iteration.

Wondertan commented 4 weeks ago

Even if technically workable, the choice to return an error implies the user is doing something wrong.

I think this is exactly where we are desynchronized. In Golang world it's its normal to use errors as a form of signalling to the user. The simplest example would be io.EOF. It's returned every time you finish reading a file and it's not an error per se.

But that's a Golang quirk and having it across language barriers isn't optimal. The new Canonical API doesn't have to be consistent with Golang semantics and thus won't have this quirk.

ramin commented 2 weeks ago

I think this issue has meandered quite far from the original bug report which is making it hard make the issue actionable. I'd like to propose that

a) we collectively acknowledge that the current JSON-RPC api has been caught lacking and thus does not provide an adequate or pleasant developer experience b) the work that has been kickstarted by @liamsi and to be spoken about in the upcoming interfaces working group meeting to both define a canonical API and trigger a rework is the idealized resolution

Then we can draw a conclusion under some of the debate.

Subsequently, as the canonical API work will take time, I'd like to propose we close this issue, considering proper (it not perfect) action being to take @SuperFluffy's suggestion to more properly document the error strings (and potentially formalize them as errors as proposed in https://github.com/rollkit/go-da/issues/65) as the action as tracked in @liamsi's tracking issue

does this ok @jcstein @liamsi @joroshiba @SuperFluffy

distractedm1nd commented 2 weeks ago

I want to butt in before this error is closed, and say that I find it extremely unintuitive that an error is returned when no blobs are found on the passed height.

There are use-cases where you do not know in advance if data is going to be available at a height or not. The transparency dictionary rollup Sebastian is building is an example of this, and it's one of the first things being built natively on top of Celestia without a rollup framework. There will be plenty more. The reason this error is infuriating in particular is because instead of just noop-ing on a loop over the returned list (as its empty), you have to go into your error matching statement, just to string compare and then throw the error away. I don't know if this intuition comes across automatically or not, but it becomes quickly apparent when you use it for the first time. It just reads really badly.

And this isn't only an issue in Rust of course -- while writing code snippets for the golang client library tutorial today, there is this simple example that also becomes cumbersome:

func SubscribeHeaders(ctx context.Context, url string, token string) error {
    client, err := client.NewClient(ctx, url, token)
    if err != nil {
        return err
    }

    // create a namespace to filter blobs with
    namespace, err := share.NewBlobNamespaceV0([]byte{0xDE, 0xAD, 0xBE, 0xEF})
    if err != nil {
        return err
    }

    // subscribe to new headers using a <-chan *header.ExtendedHeader channel
    headerChan, err := client.Header.Subscribe(ctx)
    if err != nil {
        return err
    }

    for {
        select {
        case header := <-headerChan:
            // fetch all blobs at the height of the new header
            blobs, err := client.Blob.GetAll(context.TODO(), header.Height(), []share.Namespace{namespace})
            if err != nil {
                fmt.Println("Error fetching blobs: %v", err)
            }

            fmt.Println("Found %d blobs at height %d in 0xDEADBEEF namespace", len(blobs), header.Height())
        case <-ctx.Done():
            return
        }
    }
}

I can't actually leave the code like this, because of this issue. I would need to change the if err != nil to if err != nil && err != blob.ErrNotFound, which seems very silly.

In general, it seems quite obvious to me that returning an empty array when no data is found there is just the expected behavior. Given that this is probably the most important endpoint in the whole API, I feel very strongly about this behavior.

SuperFluffy commented 2 weeks ago

@distractedm1nd I believe one way forward would could be to provide for example blob.ListGetAll and blob.BatchGetAll, where the former covers your use case, while blob.BatchGetAll would cover the current functionality.

distractedm1nd commented 2 weeks ago

The point of the blob module was to have a simple, intuitive way to do things - which is why it is very opinionated. I find that splitting the method up defeats that purpose by providing multiple ways to do things inside of the module. And especially for such a trivial difference, I think it would just lead to additional confusion. This is the same reason that blob.Submit is not very parametrizable, and the alternative state.SubmitPayForBlobs exists. To be consistent here, my opinion is that we should pick the most intuitive option, and have the client handle the error explicitly for the non-intuitive case.

Splitting it would mean we have 3 ways to fetch blobs from the network through the RPC. This is a double-edged sword. More users would have the exact method they need without adding any special handling, but then developers also need to learn all 3 and the differences between them to determine the right one -- and it also puts the same burden on the readers of the code. Let's keep it as simple as possible

liamsi commented 2 weeks ago

Agreed that things should be kept simple and not introducing several similarly sounding API endpoints for the same tasks.

I would need to change the if err != nil to if err != nil && err != blob.ErrNotFound, which seems very silly.

This is actually not uncommon in golang (see the EOF case that Hlib brought up several times above). That said, while this is acceptable in go, it is not acceptable for the API anyways. And I do agree that we should change this to return an empty result instead of an error.

a26nine commented 1 week ago

I am blocked because of the same issue.

I am setting up infrastructure for a new OP-stack-based L2, Camp(aign). It uses Celestia as the DA layer. In order for op-node to sync, I need to supply it with a DA RPC, ie, Celestia light node endpoint.

So, I spun up and got mocha-4 light node synced to the tip and passed it to op-node through the da.rpc flag. However, op-node is unable to sync from genesis because it's receiving error from the upstream Celestia light node.

op-node:

May 04 03:17:11 camp-sepolia-ronald-full-ora-iad-a00 op-node[4388]: t=2024-05-04T03:17:11+0000 lvl=warn msg="Derivation pipeline is reset" err="engine stage failed: failed to open data source: reset: celestia: failed to resolve frame: blob: not found"

celestia-node:

2024-05-04T03:18:15.031Z        WARN    rpc     go-jsonrpc@v0.3.1/handler.go:444        error in RPC call to 'da.Get': blob: not found

Is there a way I can fix this issue? If not, I hope returning an empty response instead of an error will help fix this issue.

liamsi commented 1 week ago

I hope returning an empty response instead of an error will help fix this issue.

This is the correct solution here IMO. Agree with @distractedm1nd.