cs3org / reva

WebDAV/gRPC/HTTP high performance server to link high level clients to storage backends
https://reva.link
Apache License 2.0
167 stars 113 forks source link

Can you elaborate on out of bound file transfer? #45

Closed butonic closed 5 years ago

butonic commented 5 years ago

I am struggling to rebase our changes on top of the review branch. In the review branch you are planning to move the file up and download out of the cs3 APIs. Can you elaborate on how you plan to do the actual file transfer?

We will need to send the file stream from the ocdavsvc service to the actual storage provider. Do you want to open another htt2 connection for that? or use the existing one to multipex binary chunks over it?

AFAIR we will always have the ocdavsvc or another gateway component in front of the actual storage provider ... so what is your vision on this?

butonic commented 5 years ago

Instead of initiating a file up or download I think it makes sense to allow passing in a reference, a chunk of a file or a list of small files. Similar to the Opaque property you introduced in other messages as well. A reference would work like the proposed initiate response, a chunk could be used for directly uploading chunks and a list o files could be used to aggregate small files into one request (or send a single smaller file)

moscicki commented 5 years ago

This requires some discussion indeed.

Let me explain the basic idea first: we could just stay with data upload and download via gRPC for simplicity to start with but with a future outlook in mind, we know that gRPC is not excellent for transferring large files as (streamed) repeated messages (essentially the max reasonable limit would be the size of the gRPC message which IIRC is 3M). Hence, for all data intensive transfer workflows the idea is to redirect to an HTTP(s) endpoint. As a result of this call you'd get a URL with some constrained validity (e.g. you need to start the transfer within next N seconds).

I see that there could be an opportunity to inline small files directly into the payload of the gRPC indeed for optimization. The question is how small is small. Looks like it should not be controversial to say that it would make sense for payloads in KB range.

For things in MB range we already enter the realm of the higher-level chunking (as we know it in owncloud).

Chunking itself is interesting: I think with current usage by the sync client we could say, that it mainly serves to provide resumable upload, right? It looks to me that no standard HTTP resumable upload exists (https://stackoverflow.com/questions/20969331/standard-method-for-http-partial-upload-resume-upload#20978266) ? Another usage is parallel upload: is this really used and makes a difference? If yes, then I would be in favour to provide a different gRPC call to cater for that use-case (possibly the same API for both parallel and resumable).

The bulk operations (bundling many independent files) would definitely deserve a different approach because these operations have complex return status (some but not all files may fail to upload for various reasons). I think this requires careful consideration but perhaps may be considered a second-order optimization and a different (set of) calls?

butonic commented 5 years ago

You had me worried there for a second. What about listing large directories (100k files), but the default grpc message size of 4 MB does not affect streams. It only affects individual chunks. Someone actually tested this (although on a loopback device): https://ops.tips/blog/sending-files-via-grpc/

His finding is that 1k seems to be a good chunk size. But plain HTTP2 seems to be twice as fast, if with a little more variance in latency.

Another, but older (2016) related post is https://andrewjesaitis.com/2016/08/25/streaming-comparison/

Thinking ...

butonic commented 5 years ago

Ok, best for now would be to at least keep the grpc streaming based file transfer or provide an example how you would do it. While grpc streaming might be sub optimal it makes implementing the api a lot easier IMO. And I think we can iterate on it after we got sharing and search in the protocol. I would prefer a more feature complete protocol, rather then trying prematurely optimize performance.

butonic commented 5 years ago

@labkode I was under the impression that clients should be able to talk directly to the services using the cs3 apis? auth aside, shouldn't we then have some api call to stream bytes between services?

labkode commented 5 years ago

@butonic @moscicki the out-of-band file transfer mechanism is implemented in the review branch, it also includes checksum negotiation for protecting the upload with the checksums offered by the server. I'll update the docs so is easy for everyone to understand it and to test it.

butonic commented 5 years ago

examples helped me, thx