Open nonsense opened 6 years ago
so this is the correct behaviour.
the Content-Length
header is set to the expected body size and response 200 is returned.
if transport fails to stream the response body unexpected EOF
is expected and correct.
Curl for instance returns:
curl: (18) transfer closed with 45046592 bytes remaining to read
I am not sure transport is supposed to fail so often. This should be a rare occurrence, not something we get on every request for a 50mb file.
However I agree that this is probably a scenario that clients of swarm must account for.
Ultimately the problem is within the ChunkStore
, which returns ErrChunkNotFound
during Read
called within serveContent
. ~I am not yet sure if this happens during syncing, or it is possible to happen even after nodes are synced.~
The problem appears to be happening while the nodes are syncing. If we add a sleep according to the filesize uploaded, the error occurs less often.
Intuitively if we agree that Swarm is eventually consistent and queries are not expected to be handled until chunks are synced to respective nodes, the correct behaviour would be to return 404 to the user, while syncing takes place, and not unexpected EOF
The error we have to find is:
root@swarmtest-2:/home/swarmtest# docker logs swarm_8506 | grep ERROR
swarm_8506 ERROR[04-22|14:36:37.39143|swarm/storage/chunker.go:522] lazychunkreader.join key=d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c err="chunk not found"
swarm_8506 ERROR[04-22|14:36:37.391577|swarm/storage/chunker.go:468] lazychunkreader.readat.errc err="chunk 393216-397312 not found; key: d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c"
swarm_8506 ERROR[04-22|14:36:37.391648|swarm/storage/chunker.go:542] lazychunkreader.readat read=0 err="chunk 393216-397312 not found; key: d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c"
swarm_8506 ERROR[04-22|14:36:37.3915|swarm/storage/chunker.go:522] lazychunkreader.join key=cbd6e4b746e4432c9e1fc36306520f176977dba425c3971be6b5a207376cf6ce err="chunk not found"
root@swarmtest-2:/home/swarmtest# docker logs swarm_8504 | grep ERROR
We have to trace the request for key d58efca29d993f75600579500355635552b67f5212b908bea9366892279e047c
and understand why it is chunk not found
.
i think 404 is misleading. we can just say a document is found but cannot be fully retrieved. I think behaviour is like reading a half-saved file on disk, clients (including maybe cli) should implenent loop if partial retrievals
Plus as i said, timed out requests should be repeated until an overall timeout kicks in
I agree 404 might be misleading, but then again this is exactly what you'd get if you upload to one node and immediately try to retrieve on another node.
I think behaviour should be similar to eventually-consistent databases, and then 404 starts to make sense - you just can't retrieve latest versions from such databases until changes are propagated.
I don't like the half-saved
file on disk, as it is more of an error.
i think the half-saved file is a very accurate analogy. swarm can be considered the world hard disk.
I think this is bad UX. How often do you read from disk and get back a corrupt (not-complete) file? - almost never, because this is abstracted properly on the kernel level.
This will be unexpected for users and we will be getting issues and bugs opened if we go live with it as is.
I agree that swarm should be considered the world hard disk, but we should also provide an intuitive API, which behaves more or less as users expect. Right now this is not the case with the HTTP GET API in my opinion.
(Sorry hit the close button by mistake :( )
this behaviour can also be solved with the correct HTTP codes.
node receiving content that has to be synced should respond with a 202 Accepted
and not a 200 OK
.
If you want to serve partial content - you can do so with an HTTP 206 Partial Content
, and kill or halt the transfer in a predetermined point and wait for the other chunks to arrive. ideally if there's missing chunk - the transfer shouldn't be aborted or errored, but halted and retry just the missing/not found chunks from the network.
2 cents.
IMHO the correct response code is 202: "The request has been accepted for processing, but the processing has not been completed" https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.3 In the response body it can return something meaningful that the content is not synced yet properly in swarm.
if there's a chunk error you can also return a 102 Processing
It seems like we agree that this is bad UX and we should address somehow the problem. I think 202 is not appropriate, because generally it is used when you push a request to a server for processing and it responds with 202 - "I got the request, I am processing it". This is completely different to our use-case (more details at https://stackoverflow.com/questions/4099869/is-it-wrong-to-return-202-accepted-in-response-to-http-get)
200 OK (the current situation) is also misleading.
I think 404 is the most appropriate, because until an item is synced within Swarm, it is not available, i.e. it does not exist yet. As soon as we respond with 200, we should have a very high confidence that the request will be served correctly.
i am not sure we can even differentiate between 404 (file/chunks missing that were never uploaded to swarm) and 202 (file/chunks that we think were uploaded to swarm, and currently not available for some reason)
@nonsense this is exactly right, glad you said it. for a 3rd party requestor no distinction possible between 'not found/not retrievable/not yet synced' - it is only possible to track syncing from the uploader. I think what you want Anton is not even compatible by streaming content: ie the problem here is that the source reader is read = the response is written as the requestor loads. to figure out there is missing chunks we would need to go through the recontruction and verification before the response header is sent.
I think 404 is the most appropriate, because until an item is synced within Swarm, it is not available, i.e. it does not exist yet. As soon as we respond with 200, we should have a very high confidence that the request will be served correctly.
i disagree with a 404. a partial resource is not a 404
necessarily. 102
should be returned in that case (the network is still processing the request). 404
is too catastrophic in my eyes, and would lead users to think that the content that they pushed somehow got lost in the way, which is not the case.
btw @nonsense, the 202
should be returned when you upload the file (e.g. as a response to the POST), not when you download it
@justelad as a requestor client you are free to handle multiple partial reads but swarm should respect the request timout
@nonsense i agree it should not happen so often that we get this error, that people will submit reports. I planning to take 2 measures against this:
@justelad as a requestor client you are free to handle multiple partial reads but swarm should respect the request timout
then the response should come under a 206 Partial Content
btw @nonsense, the 202 should be returned when you upload the file (e.g. as a response to the POST), not when you download it
we are discussing what happens when you are downloading/issuing a GET request. but yeah, fixing the response code on the POST if it is not 202 would probably be a good idea as well.
we are discussing what happens when you are downloading/issuing a GET request. but yeah, fixing the response code on the POST if it is not 202 would probably be a good idea as well.
i am following the conversation :) a first part of an API consumer's understanding of the intricate operations on how an API works, is analysing its' responses. 200
suggests that the content has been received and stored. period. 202
suggests an async operation that does not complete immediately. hence, an immediate request of the same data might result in an error or some other predefined behaviour (which we can obviously define and give attention to).
200 suggests that the content has been received and stored. period.
agree, but we have no way of knowing right now if the content is synced to respective nodes.
202 suggests an async operation that does not complete immediately. hence, an immediate request of the same data might result in an error or some other predefined behaviour
this is something I would not expect having worked with other APIs. it is not intuitive behaviour and probably against REST definitions.
After internal discussion we agreed to try to increase the search timeout and re-request chunks until the global context timeout is reached. This might solve this, until we have proof of custody of chunks, etc.
My opinion is looking at it from the "world computer" angle, things can be separated:
A client which uploads a resource needs to know if that operation worked fine. 200 OK is correct if the upload succeeded. Actually 201 is fine too, "Created".
A client requesting the file should get 200 if the the file is available. If not, a 404 should be returned - whatever the reason, as I would understand a client would, in our case above, be required to retry the request. So the only other appropriate response I could see is if the response contains a "please retry" code
It seems like you landed on the suggestion I am about to make, but I'll make it anyway.
Considering the filesystem analogy, let's assume it is a network filesystem like NFS. If I open a file and the kernel tells me it exists, then I expect to be able to just sequentially read that file. If the kernel is slow at retrieving a chunk of that file, I end up just waiting until it has the chunk. I can set a timeout on the read calls if I want to, but maybe it is more important to just get the file regardless of how long it takes, rather than getting an error because the network is running a little slow.
So in Swarm, if a node finds a root chunk, it should assume the file exists and just try its best to get all the chunks until the client gives up and disconnects (basically until req.Context.Done()
closes). I haven't looked at the latest code, but I was hoping we would adopt the context pattern throughout the API which would make this the natural choice.
@lmars thanks for the suggestion and the analogy. This is indeed what we are planning to do to fix this.
@nonsense can we close this because the search timeout has been raised?
I'd rather we keep it around. We don't have a good solution, so it is good to keep the history open in case other people encounter this and are also confused just as we were.
@nonsense close this?
When we upload a 50MB file, it appears that nodes are responding with 200 OK, before the file has finished syncing up on the given node:
For example, try running the following command, and notice how we get errors for ~1-2minutes, but after that file is retrieved successfully: