Fix 200 OK return EOF - Githubissues

nonsense commented 6 years ago

When we upload a 50MB file, it appears that nodes are responding with 200 OK, before the file has finished syncing up on the given node:

TRACE[03-30|19:06:24.640387|swarm-smoke/upload_and_sync.go:77] http get response                        ruid=1fd89f5b api=http://8509.testing.swarm-gateways.net hash=11b5487b5d307c10d940d9e4aa05643fa00c82a428d76ec9027245add31f834c code=200 len=50000000
WARN [03-30|19:06:28.841038|swarm-smoke/upload_and_sync.go:89] unexpected EOF                           ruid=1fd89f5b

For example, try running the following command, and notice how we get errors for ~1-2minutes, but after that file is retrieved successfully:

swarm-smoke --filesize 50 --bzz-api http://8504.testing.swarm-gateways.net c

zelig commented 6 years ago

so this is the correct behaviour. the Content-Length header is set to the expected body size and response 200 is returned. if transport fails to stream the response body unexpected EOF is expected and correct.

Curl for instance returns:

curl: (18) transfer closed with 45046592 bytes remaining to read

nonsense commented 6 years ago

I am not sure transport is supposed to fail so often. This should be a rare occurrence, not something we get on every request for a 50mb file.

However I agree that this is probably a scenario that clients of swarm must account for.

nonsense commented 6 years ago

Ultimately the problem is within the ChunkStore, which returns ErrChunkNotFound during Read called within serveContent. ~~~I am not yet sure if this happens during syncing, or it is possible to happen even after nodes are synced.~~~

The problem appears to be happening while the nodes are syncing. If we add a sleep according to the filesize uploaded, the error occurs less often.

Intuitively if we agree that Swarm is eventually consistent and queries are not expected to be handled until chunks are synced to respective nodes, the correct behaviour would be to return 404 to the user, while syncing takes place, and not unexpected EOF

nonsense commented 6 years ago

The error we have to find is:

root@swarmtest-2:/home/swarmtest# docker logs swarm_8506 | grep ERROR
swarm_8506 ERROR[04-22|14:36:37.39143|swarm/storage/chunker.go:522]        lazychunkreader.join                     key=d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c err="chunk not found"
swarm_8506 ERROR[04-22|14:36:37.391577|swarm/storage/chunker.go:468]        lazychunkreader.readat.errc              err="chunk 393216-397312 not found; key: d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c"
swarm_8506 ERROR[04-22|14:36:37.391648|swarm/storage/chunker.go:542]        lazychunkreader.readat                   read=0 err="chunk 393216-397312 not found; key: d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c"
swarm_8506 ERROR[04-22|14:36:37.3915|swarm/storage/chunker.go:522]        lazychunkreader.join                     key=cbd6e4b746e4432c9e1fc36306520f176977dba425c3971be6b5a207376cf6ce err="chunk not found"
root@swarmtest-2:/home/swarmtest# docker logs swarm_8504 | grep ERROR

We have to trace the request for key d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c and understand why it is chunk not found.

zelig commented 6 years ago

i think 404 is misleading. we can just say a document is found but cannot be fully retrieved. I think behaviour is like reading a half-saved file on disk, clients (including maybe cli) should implenent loop if partial retrievals

Plus as i said, timed out requests should be repeated until an overall timeout kicks in

nonsense commented 6 years ago

I agree 404 might be misleading, but then again this is exactly what you'd get if you upload to one node and immediately try to retrieve on another node.

I think behaviour should be similar to eventually-consistent databases, and then 404 starts to make sense - you just can't retrieve latest versions from such databases until changes are propagated.

I don't like the half-saved file on disk, as it is more of an error.

zelig commented 6 years ago

i think the half-saved file is a very accurate analogy. swarm can be considered the world hard disk.

you write to disc = you upload to swarm
a document is considered saved and readable when all parts actually are written and can be read from disk = when all chunk are synced and can be retrieved by routing.

nonsense commented 6 years ago

I think this is bad UX. How often do you read from disk and get back a corrupt (not-complete) file? - almost never, because this is abstracted properly on the kernel level.

This will be unexpected for users and we will be getting issues and bugs opened if we go live with it as is.

I agree that swarm should be considered the world hard disk, but we should also provide an intuitive API, which behaves more or less as users expect. Right now this is not the case with the HTTP GET API in my opinion.

(Sorry hit the close button by mistake :( )

acud commented 6 years ago

this behaviour can also be solved with the correct HTTP codes. node receiving content that has to be synced should respond with a 202 Accepted and not a 200 OK. If you want to serve partial content - you can do so with an HTTP 206 Partial Content, and kill or halt the transfer in a predetermined point and wait for the other chunks to arrive. ideally if there's missing chunk - the transfer shouldn't be aborted or errored, but halted and retry just the missing/not found chunks from the network. 2 cents.

gbalint commented 6 years ago

IMHO the correct response code is 202: "The request has been accepted for processing, but the processing has not been completed" https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.3 In the response body it can return something meaningful that the content is not synced yet properly in swarm.

acud commented 6 years ago

if there's a chunk error you can also return a 102 Processing

nonsense commented 6 years ago

It seems like we agree that this is bad UX and we should address somehow the problem. I think 202 is not appropriate, because generally it is used when you push a request to a server for processing and it responds with 202 - "I got the request, I am processing it". This is completely different to our use-case (more details at https://stackoverflow.com/questions/4099869/is-it-wrong-to-return-202-accepted-in-response-to-http-get)

200 OK (the current situation) is also misleading.

I think 404 is the most appropriate, because until an item is synced within Swarm, it is not available, i.e. it does not exist yet. As soon as we respond with 200, we should have a very high confidence that the request will be served correctly.

zelig commented 6 years ago

i am not sure we can even differentiate between 404 (file/chunks missing that were never uploaded to swarm) and 202 (file/chunks that we think were uploaded to swarm, and currently not available for some reason)

@nonsense this is exactly right, glad you said it. for a 3rd party requestor no distinction possible between 'not found/not retrievable/not yet synced' - it is only possible to track syncing from the uploader. I think what you want Anton is not even compatible by streaming content: ie the problem here is that the source reader is read = the response is written as the requestor loads. to figure out there is missing chunks we would need to go through the recontruction and verification before the response header is sent.

acud commented 6 years ago

I think 404 is the most appropriate, because until an item is synced within Swarm, it is not available, i.e. it does not exist yet. As soon as we respond with 200, we should have a very high confidence that the request will be served correctly.

i disagree with a 404. a partial resource is not a 404 necessarily. 102 should be returned in that case (the network is still processing the request). 404 is too catastrophic in my eyes, and would lead users to think that the content that they pushed somehow got lost in the way, which is not the case.

acud commented 6 years ago

btw @nonsense, the 202 should be returned when you upload the file (e.g. as a response to the POST), not when you download it

zelig commented 6 years ago

@justelad as a requestor client you are free to handle multiple partial reads but swarm should respect the request timout

@nonsense i agree it should not happen so often that we get this error, that people will submit reports. I planning to take 2 measures against this:

search timeouts should not result in reader errors instead the chunk should be rerequested and until the global request context times out or aborted
have a ETA progress bar on upload to teach people the delay they need to expect before swarm has all chunks retrievable.

acud commented 6 years ago

@justelad as a requestor client you are free to handle multiple partial reads but swarm should respect the request timout

then the response should come under a 206 Partial Content

nonsense commented 6 years ago

btw @nonsense, the 202 should be returned when you upload the file (e.g. as a response to the POST), not when you download it

we are discussing what happens when you are downloading/issuing a GET request. but yeah, fixing the response code on the POST if it is not 202 would probably be a good idea as well.

acud commented 6 years ago

we are discussing what happens when you are downloading/issuing a GET request. but yeah, fixing the response code on the POST if it is not 202 would probably be a good idea as well.

i am following the conversation :) a first part of an API consumer's understanding of the intricate operations on how an API works, is analysing its' responses. 200 suggests that the content has been received and stored. period. 202 suggests an async operation that does not complete immediately. hence, an immediate request of the same data might result in an error or some other predefined behaviour (which we can obviously define and give attention to).

nonsense commented 6 years ago

200 suggests that the content has been received and stored. period.

agree, but we have no way of knowing right now if the content is synced to respective nodes.

202 suggests an async operation that does not complete immediately. hence, an immediate request of the same data might result in an error or some other predefined behaviour

this is something I would not expect having worked with other APIs. it is not intuitive behaviour and probably against REST definitions.

nonsense commented 6 years ago

After internal discussion we agreed to try to increase the search timeout and re-request chunks until the global context timeout is reached. This might solve this, until we have proof of custody of chunks, etc.

holisticode commented 6 years ago

My opinion is looking at it from the "world computer" angle, things can be separated:

A client which uploads a resource needs to know if that operation worked fine. 200 OK is correct if the upload succeeded. Actually 201 is fine too, "Created".
A client requesting the file should get 200 if the the file is available. If not, a 404 should be returned - whatever the reason, as I would understand a client would, in our case above, be required to retry the request. So the only other appropriate response I could see is if the response contains a "please retry" code

lmars commented 6 years ago

It seems like you landed on the suggestion I am about to make, but I'll make it anyway.

Considering the filesystem analogy, let's assume it is a network filesystem like NFS. If I open a file and the kernel tells me it exists, then I expect to be able to just sequentially read that file. If the kernel is slow at retrieving a chunk of that file, I end up just waiting until it has the chunk. I can set a timeout on the read calls if I want to, but maybe it is more important to just get the file regardless of how long it takes, rather than getting an error because the network is running a little slow.

So in Swarm, if a node finds a root chunk, it should assume the file exists and just try its best to get all the chunks until the client gives up and disconnects (basically until req.Context.Done() closes). I haven't looked at the latest code, but I was hoping we would adopt the context pattern throughout the API which would make this the natural choice.

nonsense commented 6 years ago

@lmars thanks for the suggestion and the analogy. This is indeed what we are planning to do to fix this.

gbalint commented 6 years ago

Connected to https://github.com/ethersphere/go-ethereum/issues/409 and https://github.com/ethersphere/go-ethereum/issues/428

gbalint commented 6 years ago

@nonsense can we close this because the search timeout has been raised?

nonsense commented 6 years ago

I'd rather we keep it around. We don't have a good solution, so it is good to keep the history open in case other people encounter this and are also confused just as we were.

skylenet commented 5 years ago

@nonsense close this?

ethersphere / swarm

Fix 200 OK return EOF #373