celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
917 stars 916 forks source link

blob: can't get blobs in arabica #3572

Open vgonkivs opened 1 month ago

vgonkivs commented 1 month ago

It's not possible to successfully retrieve the dah at a height 1373869 on arabica-11.

Steps to reproduce: 1) Launch celestia light node on arabica-11: celestia light run --core.ip=consensus.celestia-arabica-11.com --p2p.network=arabica 2) Wait until the node syncs to the needed height; 3) request blob: celestia blob get 1373869 0x8f8736b6ff9dc08065a6 0x9ede456e70b3c95ec779bd5b434d787911b1911b29629e88c80b5cc52736021b

Expected behavior: Blob's info is printed;

Actual behavior: The request is stuck until a timeout is reached(in case of CLI request forever)

Tested on the latest main.

NOTE: It happens only for this particular height.

vgonkivs commented 1 month ago

After debugging this issue with debug level log in both blob and share services, I've detected that the shrex client is constantly getting share.ErrNotFound:

Screenshot 2024-07-17 at 15 57 25

I have also detected that it works fine if the request goes over ipld getter.

Wondertan commented 1 month ago

The other surprising thing/bug is that it works only when we disable shrex, while it should have worked with CascadeGetter swtiching to IPLDGetter after ShrexGetter has failed.

vgonkivs commented 1 month ago

The other surprising thing/bug is that it works only when we disable shrex, while it should have worked with CascadeGetter swtiching to IPLDGetter after ShrexGetter has failed.

It does not switch to the IPLDGetter as there is no timeout in the context, so we are stuck in shrex forever. This is a bug since we should be able to switch to another getter after N attempts if they are constantly failing. In case the user does want to have a timeout in this call then we should loop over all getters and not just one(Async getter fixes this issue too btw).

vgonkivs commented 1 month ago

https://github.com/celestiaorg/celestia-node/issues/2501

Wondertan commented 1 month ago

Seems like the bug is that we never actually set the timeout on this operation

vgonkivs commented 1 month ago

Yeah, so setting minTimeout fixes this.

Wondertan commented 1 month ago

Ok, we should still this or other issue around for ShrexGetter not working as expected

srene commented 1 month ago

so is there any option i can set in the light client to retrieve the blob, or do i have to wait for a new release that fixes the bug? thanks

Wondertan commented 1 month ago

@srene, do you face this particular issue? Generally, blobs are retrievable.

srene commented 1 month ago

this particular case was reported by me. we are not able to get a specific blob for a rollapp in dymension testnet, and is causing some issues to get full-nodes syncing.

Wondertan commented 1 month ago

Ah, I see. We can provide a quick fix branch for this issue.

vgonkivs commented 1 month ago

Hello, @srene. Do you need a test branch with the quick fix so you can test it?

srene commented 1 month ago

yes, that will help. thanks

vgonkivs commented 1 week ago

Hello, @srene. I made a fix and it is a part of v0.15.0 release.