ethersphere / bee

Bee is a Swarm client implemented in Go. It’s the basic building block for the Swarm network: a private; decentralized; and self-sustaining network for permissionless publishing and access to your (application) data.
https://www.ethswarm.org
BSD 3-Clause "New" or "Revised" License
1.44k stars 338 forks source link

Bee 2.0 is still corrupting chunks on pinned resources #4653

Open tmm360 opened 2 months ago

tmm360 commented 2 months ago

Context

Bee 2.0.0 running on linux container

Summary

I'm uploading a lot of data on node, pinning and with deferral enabled. I'm still not using erasure coding. What I've noted is that a lot of pinned resources has invalid and/or missing chunks.

I've run both commands validate and validate-pin, and these are the outputs: validation-log.txt address.csv

I've noticed that sometimes bee freezes connections (this is another issue that needs to be investigated), and after some time it can restart to serve requests. This mean that a failing request could be retried after a timeout, but this shouldn't leave an invalid state.

Expected behavior

No invalid or missing chunks should be leaved on local storage, even after a failed request (IF this is the cause...)

Actual behavior

The local storage leaves a lot of invalid and missing chunks from pinned contents.

Steps to reproduce

I'm simply uploading a lot of data, with hundreds of MB for each file, pinning and with deferred uploads. Because to estimate a correct postage batch is difficult, temporary I'm doing the workaround to use dynamic postages, and overstimating required space. It is still possible that some content anyway could fulfill a postage. I don't know if this can be the cause, anyway this shouldn't leave any inconsistent state.

Possible solution

Don't know.

ldeffenb commented 2 months ago

I believe 2.0 has a "feature" where even supposedly "immutable" stamps actually get chunks replaced. So if you are overflowing a stamp, the swarm will definitely be dropping chunks. But they SHOULD (IMHO, haven't verified in the source) still be available on the original node if you specified pinning. They just won't be available anywhere else.

ldeffenb commented 2 months ago

Try doing a /stamps/{stampid}/buckets and see if you're actually overflowing any of the buckets.

tmm360 commented 2 months ago

Yes, I've verified and I've a lot of fulfilled buckets

ldeffenb commented 2 months ago

Yes, I've verified and I've a lot of fulfilled buckets

I can't explain why the original pinning node would have missing chunks in the pins, but this would 100% explain why the data may not be retrievable from the swarm. I haven't experimented with overflowing stamps, so I've never chased down the source paths that would deal with that during the original upload/pin activity.

At least, I assume you're using the swarm-pin: true header on the upload and not explicitly pinning the reference after a non-pinned upload?

The former pins in the uploading node immediately. The latter might actually drop chunks from the localstore as soon as they are pushed out, and then try to pull the chunks back from the swarm in order to pin it. If the stamp overflowed during the push, then the chunks may not even be available to pin.

But as I type this, that last bit I described would actually cause the pin to fail, not to have an existing pin with missing chunks in it. Unless, of course, there's another race condition in the stamp overflow code that I haven't noticed.

tmm360 commented 2 months ago

Yes, I know that data could have been not retrievable from the network, but it's the first time now that I'm going to actually explore how many buckets are filled. At first we was workarounding with pin, thinking that data could have persisted, but now we are instead going to develop new tools to pre-calculate client side the exact postage size, necessary to not fulfill buckets.

Anyway, buckets aside, data should persist if pinned with the header, and I confirm we are directly using it.

nikipapadatou commented 2 months ago

Looking into this but the team has not been able to reproduce it. Could you provide some more details please? cc @acha-bill

tmm360 commented 2 months ago

I don't know what kind of additional information I could provide. I don't know how to reproduce programatically. I only know that new content has been corrupted running bee 2.0, that sometime node freezes, and is unable to provide any kind of response. Probably, I suppose, while it digests uploads. After a while, it restore its functionality. Try to investigate on concurrent requests, maybe same request reproposed in short time, really don't know...

istae commented 1 month ago

The upload has to be a "valid" upload for the pinning to work, meaning using a batch that exists and one that has not been over-issued (in the case that of an immutable batch). can you confirm that this is the case?

tmm360 commented 1 month ago

Because this has been a workaround until now, we was using only dynamic postage batches. So pinning should work...

istae commented 1 month ago

I believe 2.0 has a "feature" where even supposedly "immutable" stamps actually get chunks replaced. So if you are overflowing a stamp, the swarm will definitely be dropping chunks. But they SHOULD (IMHO, haven't verified in the source) still be available on the original node if you specified pinning. They just won't be available anywhere else.

what actually happens is if a chunk has been previously stamped is getting restamped, the stamp is updated with a new timestamp, and then the chunk is restamped. The idea is in the case of a batch index collision, the reserves should simply store the chunk with the newer timestamp (regardless of batch type). you can see this in effect here https://github.com/ethersphere/bee/blob/master/pkg/postage/stamper.go#L51

immutable batches do not replace chunks and do not allow for overflowing.

istae commented 1 month ago

Because this has been a workaround until now, we was using only dynamic postage batches. So pinning should work...

we have some major improvements coming up in 2.1, let's see if you can reproduce this issue with it.

ldeffenb commented 1 month ago

immutable batches do not replace chunks and do not allow for overflowing.

Well, they're not SUPPOSED to replace chunks, but if an uploading node undergoes a db nuke, then all local knowledge of what buckets and indices are used on the stamp is lost. But the stamp itself is reloaded from the blockchain fresh and clean.

So, if the stamp is used to upload data in a different sequence than was originally done, chunks may be stamped with different indices in the same buckets as they originally were. When these newly uploaded and stamped chunks are pushed to their target neighborhood, the nodes there will replace existing chunks at that index with the new chunk, effectively evicting the original chunk.

I believe this is why there is so many missing data chunks in the OSM dataset since "the great expiration eviction" caused by the contract update.

I am actively working on an approach to re-stamp and re-push all available OSM tile chunks in all previous versions, but am going to hold off on doing this on mainnet swarm until 2.1.* is deployed and hopefully a bunch of corrupt reserves are cleaned up across the swarm.

istae commented 1 month ago

db nuke does not delete the stamperstore dir which stores the stamp data.

btw, any good news regarding this post 2.1 release?

tmm360 commented 1 month ago

I still didn't test with new uploads @istae. Will check local storage health status after the next big upload that will perform