/chunks/xxxxxxx?targets= causes crash

ldeffenb commented 3 years ago

Summary

Running a v1.0.0 bee node with global-pinning-enable: true. Requested a chunk with targets that the node did not have with:

curl http://192.168.10.185:8080/chunks/5e7132318b6f5482fb02e2ed48c293800f0774470387412fbd3db1a3366736bd?targets=f99db6

this returned the expected:

{"message":"chunk recovery initiated. retry after sometime.","code":202}

At this point, the bee node consumes copious amounts of CPU and then abruptly terminates with:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xe012ab]

goroutine 129928953 [running]:
github.com/ethersphere/bee/pkg/pss.(*pss).Send(0xc0006d70e0, 0x18e53e0, 0xc00003e0a0, 0xb848d6c7d44639f3, 0xd02c4e295a5b9716, 0x776eafbe3b931776, 0xcf20d5397b98a205, 0xc005b29f60, 0x20, 0x160, ...)
        github.com/ethersphere/bee/pkg/pss/pss.go:98 +0x14b
github.com/ethersphere/bee/pkg/recovery.NewCallback.func1(0xc005b29f60, 0x20, 0x160, 0xc109039d00, 0x1, 0x1)
        github.com/ethersphere/bee/pkg/recovery/repair.go:38 +0xd1
created by github.com/ethersphere/bee/pkg/netstore.(*store).Get
        github.com/ethersphere/bee/pkg/netstore/netstore.go:56 +0x2ec

Steps to reproduce

Run a bee node with global-pinning-enable: true and request a /chunks/xxxxxx?targets=yyyyyy where yyyyyy must be an even count of hex digits. Wait for the response and then wait even longer for the crash to occur. I believe the additional delay is caused by pss.Wrap inside pss.Send attempting to target the appropriate neighborhood.

Expected behavior

I would expect the recovery request to be issued successfully via pss.

Actual behavior

The application crashed.

I have tracked it down through the source code and it seems that pss.Send now requires a stamper, but the recovery repair callback specifies null. As soon as the stamper is used in pss.Send, the application terminates.

https://github.com/ethersphere/bee/blob/6f3a38292e5e0fa3cd4856bf7dc9f5ba27c6c293/pkg/pss/pss.go#L98

https://github.com/ethersphere/bee/blob/6f3a38292e5e0fa3cd4856bf7dc9f5ba27c6c293/pkg/recovery/repair.go#L38

https://github.com/ethersphere/bee/blob/6f3a38292e5e0fa3cd4856bf7dc9f5ba27c6c293/pkg/netstore/netstore.go#L57

acud commented 3 years ago

I will look into the segfault, but please keep in mind that targets specifies the preceding address bytes that the node will trying to mine. If you specify 1 byte (two hex characters) - it means that it will mine a chunk address with proximity order 16 to the specified target prefix; if you specify 2 bytes you're already at 32, and you are trying to mine an address with po 48, i.e. you're going to have to go through 281474976710656 chunks and hash them in order to get an address at po 48 (if you're lucky!)

acud commented 3 years ago

Sorry I actually made a mistake. Each byte is 8 PO gains, so 3 bytes is PO 24. That's 16777216 hashing operations.

istae commented 3 years ago

@acud As @ldeffenb mentioned, the recovery.NewCallback function that the /chunks api uses sends a PSS message with the RECOVERY topic to retrieve the chunk from the network. So it needs to stamp the payload before doing so, but a nil stamper is currently passed in. In order to stamp, we need a valid batch ID.

Is the correct implementation reading the batch ID from the request header?

ldeffenb commented 3 years ago

Any idea when this one will get attention? I agree with @istae that the request specifying the ?targets= should be required to include a batch for use in sending any required PSS messages for the actual recoveries.

acud commented 3 years ago

@ldeffenb this is not so high up on our priority list. there are some usability problems with the current implementation of global pinning that are a bit difficult to solve right now, and after an in-depth look they might even require protocol changes on some degree so that good UX can ensue.

ldeffenb commented 3 years ago

Disappointed, but understood. I'm still working to characterize the storage failures of the swarm (some small percentage of freshly pushed chunks being non-retrievable within a day or two) and the global pinning would allow me to develop a way to have the missing chunks automatically supplied when discovered.

I'll keep a watch on this and wait to see what direction it takes when it bubbles to the top of the priority list.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.

tmm360 commented 2 years ago

If this is not solved, please @acud open again.

acud commented 2 years ago

so... this is a multi-tiered problem... relates to the global pinning bad UX. the problem here is that you need to provide a postage stamp on a GET call which I personally find to be *shrug* at best. This is not on our prio right now, and global pinning needs a serious overhaul with a potential dedicated protocol if it is to be used by anyone

tmm360 commented 2 years ago

so, until an eventual protocol upgrade, global pinning is now a "not-maintained" feature? For understand if consider it on apps design

ldeffenb commented 2 years ago

But it seems that maybe it should be disabled, or at least patched so that any use of it doesn't actually cause the node to crash. Better to return an error or silently do nothing, than to crash with a hard-coded null stamper reference.

acud commented 2 years ago

so, until an eventual protocol upgrade, global pinning is now a "not-maintained" feature? For understand if consider it on apps design

It has significant UX problems and we don't have bandwidth to further develop it at this stage. We're also looking into ways to offload this to L2 services that could do the message interaction for you. Obviously, you can write an application that somehow tries to handle this correctly and attaches all necessary information to the request so that you can indeed use global pinning. However if the case is for an application that does not employ a custom gateway (that attaches the necessary postage stamp to the request), my assumption at this stage is that the requirement that users should have a postage stamp ready and attached to every HTTP GET seems not so user friendly and counter intuitive both for users and developers alike. So TL;DR: you can keep using it in its current form; it will probably get revamped at some point (in which case you'll have ample time to react); we're not gonna yank it out of the codebase for now; and indeed the panic should be resolved.

tmm360 commented 2 years ago

Thank you for explanation. Luckily it isn't a blocking feature for us, anyway now I know that its better if we try to find alternative solutions, if will need.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ethersphere / bee