cs3org / reva

WebDAV/gRPC/HTTP high performance server to link high level clients to storage backends
https://reva.link
Apache License 2.0
167 stars 113 forks source link

Decide on HTTP chunked-data transfer protocol #290

Open labkode opened 4 years ago

labkode commented 4 years ago

Compare different chunking data transfer protocols and chose the one that everyone should implement.

The protocol needs to allow to upload by chunks of data (files are split into small chunks) and resumable, if the transfer is cut, the client should be able to discover what are the missing chunks to be uploaded.

For the latter, should the client remember the chunks that it has already uploaded or should ask the server to get the list? Different protocols do differently

Proposed protocols:

guruz commented 4 years ago

You forgot the one used by ownCloud ;-) ("chunking NG") https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md (Not sure if this is the latest greatest documentation though, the desktop code is what counts)

labkode commented 4 years ago

@guruz I've updated the issue description with the two protocols that I remeber, thanks for the heads up!

guruz commented 4 years ago

@labkode https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md is chunking NG. the old chunking is https://github.com/owncloud/core/wiki/spec:-big-file-chunking and not worth to be put in your list :-)

labkode commented 4 years ago

@guruz links updated :) we still use the old chunking ... and definitely we want to get rid of it with whatever we chose for the new APIS

butonic commented 4 years ago

Yes, I have a strong opinion which one we should use.

labkode commented 4 years ago

The current data service will go away once we decide on the protocol, is currently there for the MVP and it does the job, files can be uploaded and downloaded and the current desktop sync client protocol is implemented in its own service. Once we decide on the direction we want to go, we'll change it as you mentioned it is not the most efficient way to transfer the files and is not re-sumable.

tus.io and ownCloud rely solely on HTTP headers while the rest rely on some encoded payload. @butonic prefers relying only on headers and I also share the same opinion.

I've done a quick reading of tus.io and I like that is well-documented and the intention is nice, with nice examples and comes with extension support. It also offers data integrity by extension, however to be compatible wit the extension you need to support minimum sha1 which won't be the case for many users, hence we'll need to create another extension if we want to be compliant with the protocol (motivation: you don't want to re-checksum 7PB of data only with sha1 to be compliant).

I've also found some major drawbacks compared with other protocols:

For non-chunked uploads: (not too much drawback)

For chunked and non-parallel uploads (not too much drawback):

For chunked and parallel uploads (major drawback): this is where the protocol is really inefficient and major consideration is needed for the sync client:

For every single chunk, you need to:

Once all the chunks have been uploaded, the client (who has to remember the URL of all the previous chunks), needs to send a final POST to concatenate the chunks by the Concat extension:

POST /files HTTP/1.1
Upload-Concat: final;/files/a /files/b

Some side notes:

If an URL for a chunk looks like this: https://test.com/files/50dbde82-e104-44c2-ad92-14396580939a, that is 60 bytes in size, the default header limit for Apache is 8KiB, the maximum number of chunks you can send before tuning the server configuration is 135, and with 5MB chunks you can upload only a ~675MB file before hitting 413 request entity too large ... and that means tuning all the proxies all the way long ...

It would be nice if more people could take a look at the protocol and give their opinion @guruz @moscicki @dragotin @diocas @glpatcern

butonic commented 4 years ago

It seems the tus.io page has not been updated with the latest protocol spec: https://github.com/tus/tus-resumable-upload-protocol/blob/master/protocol.md

For non-chunked uploads: (not too much drawback)

  • You need first to POST to declare that you want to upload a file (using the Tus Creation extension), then the server will provide you the location where to upload the data
  • You do PATCH with the Termination extension as you only have one chunk, the Termination extension is needed so the server knows the file is completed and can be assembled/closed.

The creation-with-upload extension allows sending a body with the initial POST.

That being said, we currently have a InitiateUploadRequest() we can either use that to set up the upload and allow sending a payload for small files / the first chunk, in effect implementing the creation and creation-with-upload extensions with CS3. the datasvc would then not need to implement the creation* extensions. Or we have the datasvc implement creation and creation-with-upload. In the former case a client can upload files using a single request. In the latter case, a client needs to do two requests, as with a todays PUT: an InitiateUploadRequest to CS3 and a PATCH to the datasvc.

Regarding `termination is an optional feature specified as an extension.`, see below.

For chunked and non-parallel uploads (not too much drawback):

  • POST to declare the upload

As above, creation-with-upload allows sending the first chunk with the creation. note that the chunk size is not limited. you can just try to upload everything. if it fails you can resume by looking up the offset with a HEAD request

  • PATCH multiple times to upload the different chunks and final PATCH with the Termination extension to the server knows the file is completed and can be assembled/closed.

Not quite, from the spec:

The Client SHOULD send all the remaining bytes of an upload in a single PATCH request, but MAY also use multiple small requests successively for scenarios where this is desirable.

The Response of the PATCH contains an Upload-Offset header that the client can use to check if all bytes have been received. The server will trigger an internal upload finished event once the Upload-Offset is equal to the Offset-Length, a header, the client can send with the initial POST or any subsequent patch ... basically as soon as he knows. The quote from the spec also already mentions chunks and checksummung. Just in case you do not want to rely on the build in TCP checksumming. Different topic ... but they already thought of it and secified it as an optional extension. termination is only needed if we want clients to be able to tell the server to free the resources: someone hits cancel in the web ui. termination is an optional feature specified as an extension.

For chunked and parallel uploads (major drawback): this is where the protocol is really inefficient and major consideration is needed for the sync client:

For every single chunk, you need to:

  • POST to declare a chunk, server will provide the location of the chunk.
  • PATCH to the chunk location as many times as the chunk is uploaded

As above, the creation-with-upload makes this a single request.

If an URL for a chunk looks like this: https://test.com/files/50dbde82-e104-44c2-ad92-14396580939a, that is 60 bytes in size, the default header limit for Apache is 8KiB, the maximum number of chunks you can send before tuning the server configuration is 135, and with 5MB chunks you can upload only a ~675MB file before hitting 413 request entity too large ... and that means tuning all the proxies all the way long ...

This has been discussed in the original issue. Think different: you don't divide the file inte blocks of equal size, but instead decide how many parallel uploads you want to execute. The chunk size is not limited and you can resume each chunk. That is why the authors did not see the use case for thousands of chunks.

It is currently not allowed, but we can easily propose a specification the accepts a list of chunks, either in tho body, or inside a chunk? The protocol is open and we can extended it if necessary.

For our use case (syncing many files) I think concatenation is not a high priority. We can fill the bandwith with parallel upload requests for other files. The client needs to try to detect stalled uploads and use HEAD+PATCH to start where he left off, anyway. if we do that with one chunk or multiple is an optimization (one we should use, of course). I do acknowledge, that the concatenation of hundreds of chunks does not allow us to map the oc "chunking NG" algorithm 'nicely' because we have to send chunks in the correct order to accomodate differnt chunk sizes which means we may have to store a few chunks, however, AFAICT the client does not send multiple chunks in parallel. So, this is a corner case for people that implemented their own client.

A way to solve this on the protocol level would be a new concatenate-babushka extension that in addition to the current:

POST /files HTTP/1.1
Upload-Concat: final;/files/a /files/b

allows:

POST /files HTTP/1.1
Upload-Concat: babushka;/files/a /files/b

The current Upload-Concat: partial header in PATCH requests tells the server not to trigger any processing events for chunks.

Or allowing the partial in the POST te alse carry upload ids ...

POST /files HTTP/1.1
Upload-Concat: partial;/files/a /files/b

HTTP/1.1 204 No Content

The beauty is that we can specify this part as an extension in their repo and maybe get feedback from @Acconut, @kvz or @evert who already gave comments on the protocol ;-)

felixboehm commented 4 years ago

Tus is most advanced and our best option. We can provide a high quality of service with tus, While still providing another super simple endpoint for a put, which can be used by scripts, simple one file uploads from an extension, ... without chunks&co.

Good research!

Acconut commented 4 years ago

If you need help with tus, we are happy to help you at https://tus.io/support.html!

jnweiger commented 4 years ago

All of the protocols discussed here so far assume, that we want to assemble chunks in the server, so that the storage eventually represents the end user file as a storage file. This (understandably) leads to complications with temp files and whatnot, when range-put or patch requests are not supported by the storage.

I'd like to consider an alternative:

Let's assume, we store chunks as individual chunk files on the storage.

Rough sketch here, please let me know what I am missing.

butonic commented 4 years ago

@jnweiger Something I long for, actually. Another benefit you did not mention explicitly is the Write Once Read Many (WORM) and Deduplication property of chunk based storage. It comes at the cost of maintaining the file to chunk relationship, which if you thunk about it ends up implementing a journaling file system. Just with block sizes in the range of tens or hundreds of megabytes instead of kilobytes.

The main reason for not pursuing that direction was the fragility of the old php / request driven architecture. Introducing a chunk based storage implementation is possible there, but messing with the way the chunks are passed through the stack to leave them intact and implementing the chunk to file relationships on top of the old filecache ... is alienating.

Furthermore, if you break up files into individual chunks you loose the comparitevely easy recoverability. With chunk based storage you need to backup the file to chunk relationship, which is now metadata in addition to the bytes on disk.

I totally agree, that when we can assume total control over the storage there are better / more flexible / interesting ways of storing bytes there than just as a file. However, we currently need to be able to trigger a workflow when a file has been uploaded, eg to encrypt, scan a file for viruses or send it to another storage via whatever api. For that we need to have a quarantine or ingress area where wiles that are processed / filtered reside before allowing them to move to the actual storage. Sometimes ther are requirements that a file has to physically be kept on a different storage or even in RAM, until it has been accepted by whatever werkflow. Sometimes even a manual workflow.

That is one of the reasons why the CS3 API delegates file up and download to an external API. Which brings me back to tus.io. It allows us to do both:

  1. upload (and resume!!!) a file upload until it is complete (only dropbox supports this), or
  2. upload chunks and trigger the assembly with a finalizing request (this is what all other apis implement).

The other reason is that we want to use all storage capabilities as best as possible. That allows us to use the WORM property of certain S3 implementations or other hardware appliances that internally store the files using chunks on disk.

Two things to keep in mind:

EOS has an API that allows range based PUT requests: http://eos-docs.web.cern.ch/eos-docs/restapi/putrange.html which we cun use to stream uploads that arrive via TUS.

To summarize, I'd love to implement a storage with WORM and deduplication that is then mounted via FUSE ... but the main goal for this issue and CS3 is to agree on a protocol that is flexible enough to handle different storage backends, yet easy to implement. I think tus.io is best suited for that task, because not only does it already support all requirements we bolted on top of the owncloud style webdav uploads like checksums or chunking. It also supports resuming and extensions are build into the protocol. Based on the underlying storage technology we could actually expose different storage capabilities as tus extensions. Something which fits the requirement of integrating whatever storage technology is used very well.

I hope that clarifies my reasoning.

davigonz commented 4 years ago

Wow, glad to read all the discussions here. I'm not an expert in chunked-data transfer protocols but with this issue I can, at least, get an overview of all the possible solutions we could implement.

I've seen that tus.io is one of the favourite candidates and after researching a bit, I found that it also has an Android client in their official GH repository. I would need to try it out but well, it looks good at first sight.

dragotin commented 4 years ago

I was reading through the whole discussion and as one of the authors of the Chunking NG proposal I take the freedom to comment here.

First, I think tus is a great approach. IMHO tus and Chunking NG share the same basic ideas, which is cool, but tus is way further on the way to become a real standard: It has many supporters, people who verified it's concepts and wrote code and put it in production. That alone makes it the top candidate, and we for sure should rather contribute than implementing our own.

I think ownClouds requirements can mostly be covered with the current feature set of tus. However, there are a few scenarios where I currently can not see a good solution:

When is the upload finished?

Think the following scenario: Client wants to upload a very big file to a place that is on an external storage which is slow.

  1. Client announces the upload. Server checks for policies, quota etc and allows the upload.
  2. Server now already might "reserve" the space on the external storage or in the quota.
  3. Client does many patch requests (in parallel) to upload content.

Now the tus page says in the FAQ that the upload is finished after the last PATCH has succeeded. That is not true for ownCloud, as it might wanna do virus checks on the whole file, move it to the slow external storage, do versioning foo, whatever.

All that takes time, and I assume it would happen during the last PATCH request in the logic of tus. That is dangerous, because: The client can not display a status during that time, the last PATCH takes an not predictable longer time than all other PATCH requests and finally it might time out, which leaves the upload behind in a bad state. We know that situation in ownCloud.

That is why Chunking NG has the final MOVE at the end, which tus does not have. Between the last PATCH and the final MOVE, the client could regularly send HEAD requests to the upload url to learn about the processing status on the server side. And once all the processing is done, the client should send the MOVE to make the file finally available.

That last move would actually not be needed, but it gives another chance to implement a decision if the client really wants to finalize the upload.

Upload as much as you can

tus says in in the FAQ: "Idealy, this PATCH request should contain as much upload content as possible" Yes, of course. But, as a client developer I can tell that it is incredible hard to decide on what the right size is. That depends on so many parameters outside of the client that I believe the client can't decide at runtime (except with super clever alogrithms maybe). So it is left to the client developer to make a decision that should work in most circumstances which can have an unpredictable result.

To avoid that we have decided to define the size of the upload chunks to a fixed size (at least back in the days, maybe @ogoffart or @guruz changed that meanwhile). Ideally that size is returned from the server as a configuration value (in OPTIONS in tus).

Delta Sync

Delta sync is also something that benefits very much from fixed chunks, because it eases to do what is described in part three of the initial Chunking NG post.

Let other clients begin early

Something that is not so much about this part of the protocol, but should still be considered: During an upload of a huge file, other clients should already be able to start downloading the same file, well knowing that it is not yet valid.

Result

Again, I think tus is the way to go, but maybe these challenges should be discussed. Maybe @Acconut can comment on how to solve the "when is the upload finished" problem?

michaelstingl commented 4 years ago

to learn about the processing status on the server side

Latest experiments (https://github.com/owncloud/core/pull/31851, https://github.com/owncloud/core/pull/32576, https://github.com/owncloud/client/pull/6614)

maybe @ogoffart or @guruz changed that meanwhile

Dynamic chunk sizes (up to 100MB) introduced with 2.4.0 (https://github.com/owncloud/client/pull/5852)

dragotin commented 4 years ago

Ok, so the KI for dynamic chunk sizes is already there. Of course :smile:

butonic commented 4 years ago

@dragotin AWESOME input! Thx a lot. Let me try to explain how I understand the points you raised.

When is the upload finished?

There are multiple levels of a finished upload:

  1. all bytes are transferred
  2. background processing has happened
  3. the file is available for download

As @dragotin pointed out: in oc10 we learned that a slow storage might cause the 2nd step to take very long. We might run into timeouts when assembling chunks ... or when running the virus scanning. This is why we added async processing. In this case the clients can poll the server for the upload progress. Having a background process and a stable upload ID makes implementing this a lot clearer.

With tus.io we can rely on a dedicated service that can execute arbitrary workflows before making the file available in the storage.

Upload as much as you can

The main feature of tus IMO is resumable uploads. The clients can upload us much as possible without having to think about a chunk size first. If the upload fails (because a proxy in between only allows eg 40mb) the client can ask the server how many bytes he has received for the upload.

I don't know if the tus.io client libs already take into account these special cases, but it should be easy to detect when bytes dont reach the server and fall back to fixed size chunking (which is specified as a tus extension).

This should cover all our use cases.

Delta Sync

We should specify this as a tus extension. Protocol first. Since tus specifies how PATCH can be used to write to a certain offset a lot of the necessary HTTP calls should be there.

Let other clients begin early

One thing I'd like to see is the status of an upload in the ui. The upload Id allows us to fetch the progress from the dataprovider service using the TUS protocol. The TUS spec does not contain any GET requests, so we could contribute an extension that allows clients to download the bytes that are available. However, downloads may have to be deferred until server processing has happened in order to eg. prevent downloading of malicious content that has not yet been antivirus scanned. If the upload is encrypted we could send the decryption key when postprocessing has happened, but that is yet another tradeof that has to be designed. Again, I'd propose a tus extension for that.

ogoffart commented 4 years ago

The clients can upload us much as possible without having to think about a chunk size first. If the upload fails (because a proxy in between only allows eg 40mb) the client can ask the server how many bytes he has received for the upload.

That's only forking if the proxy did forward the first 40MB to the owncloud backend.

Some proxy would just buffer the request, and if the buffer is full, or if there is a disconnection, or another error then the request is dropped and the owncloud server will not know about it, and the client will have to re-start from scratch.

butonic commented 4 years ago

I 100% agree. AFAICT using a single PATCH request might break in various unpredictable ways that would prevent the upload from ever finishing. There actually is a PR fqor the tus.io protocol to let the server recommend a max chunk size: https://github.com/tus/tus-resumable-upload-protocol/pull/93

With this the serler admin can configure a max chunk size which is exposed to a tus client as Tus-Max-Chunk-Size header. It can than use subsequent PATCH requests to upload large files, using the resume nature oftus ... or ... it uses the compose extension if the server supports it and uploeds multiple chunks at once to do a final assembly. Very much like oc chunking v2. the difference is a properly specified and open protocol that is used by multiple clients.

dragotin commented 4 years ago

@ogoffart would it help to put a timeout on the answer of the first PATCH (or actually on all...) and if it was not answered properly within the time the upload is considered bogus and indeed started from scratch?

butonic commented 2 years ago

@dragotin @Hugo_Gonzalez_Labrador I think we can close this? I could add the resumable upload specified by msgraph, but we currently have tus.io implemented, so that would be an enhancement.

dragotin commented 2 years ago

Yes, the decision is that we use tus.

dragotin commented 2 years ago

Aaah, ok, maybe not my call ;-)