Efficient browser upload

jeffhhk commented 3 years ago

When implementing streaming HTTP uploads of known size, the fastest performance I have achieved has been with the header "Content-Type: application/octet-stream", because it takes away the client and server responsibility to cipher the stream.

application/octet posts can be simulated using curl.

Here's the curl flag to do large post with Content-Type: application/octet-stream type:

  man curl:
   --data-binary <data>
          (HTTP)  This  posts data exactly as specified with no extra pro‐
          cessing whatsoever.

          If you start the data with the letter @, the rest  should  be  a
          filename.   Data  is  posted  in  a similar manner as -d, --data
          does, except that newlines and carriage  returns  are  preserved
          and conversions are never done.

Here's a whole example:

  curl -v -H "filename: $filename" \
          -H "Content-Type: application/octet-stream" \
          --data-binary @$filename -X POST $url

With context, coded against a python server:

https://github.com/hugapi/hug/issues/474

XMLHttpRequest in modern browsers have the ability to send application/octet-stream requests. Here is an example:

https://gist.github.com/inemtsev/a45daa46fbdcdd6f80a65eed693a0689

kennethbruskiewicz commented 3 years ago

Does this hold with binary/octet-stream as well?

Boto3 uses this as its default content type: https://github.com/boto/boto3/issues/548#issuecomment-296861474

jeffhhk commented 3 years ago

Yes binary/octet-stream is synonymous with application/octet.

jeffhhk commented 3 years ago

I played with this code quite a bit yesterday. My original suggestion was about the browser to server connection bandwidth. However, in your code base, the far bigger issue is the server to S3 connection bandwidth.

It is admirable of you to try to stream straight to S3 without making any temporary files at all. However, in practice fast uploads to S3 can be difficult, especially trying to stream end to end.

I'm working on concrete recommendations for the tighter bottleneck.

kennethbruskiewicz commented 3 years ago

Thanks. For the direct url upload, I've added an application/octet header for smart_open when opening the source URL.

RichardBruskiewich commented 3 years ago

I played with this code quite a bit yesterday. My original suggestion was about the browser to server connection bandwidth. However, in your code base, the far bigger issue is the server to S3 connection bandwidth.

It is admirable of you to try to stream straight to S3 without making any temporary files at all. However, in practice fast uploads to S3 can be difficult, especially trying to stream end to end.

I'm working on concrete recommendations for the tighter bottleneck.

Hi @jeffhhk, we may need to be mindful of the global architecture here: remember that the production server runs on EC2 which ought to have a privileged connection to S3. When running the code in development mode on one's own machine, does however mean that the server-S3 connection will be much relatively slower than the local file directory-> browser -> (local) server connection. The opposite is likely true of the production system running on EC2: the server-S3 connection will trump the user machine->(remote) server connection. From my experience, years ago, the AWS internal network between various AWS components is blazingly fast. The mainstream internet, not so much so.

The primary bottleneck in community use is more likely to be the segment of the upload flow from user machine to EC2 server, and less, the EC2 server to S3 bucket. Injecting a temporary file on the server to catch the file, is not likely going to help a great deal.

That said, in a fashion, the existing code base (or at least, some of its 3rd party libraries) are very much already playing the "local temp file" card (or facsimile thereof, in memory streaming using BytesIO file objects).

Kenneth and I were taking a closer look at this yesterday, wondering how to improve upon the situation. Jury still out. Our review, however, mainly focused on improving the efficiency the post-processing (tar.gz archiving) step.

One last point I'd make at this time is that we should remember our degrees of freedom: we are developing KGE as an AWS application. Therefore, maybe we need to ponder what kind of EC2 instance configuration will get us where we want to go. For example, there are memory optimized instances that also have NVMe SSD drives attached. Maybe we could leverage that in some manner to enhance performance for large files.

jeffhhk commented 3 years ago

Disregard this comment of mine: "the far bigger issue is the server to S3 connection bandwidth." I made a mistake when I set up some of the measurements I did.

The original suggestion to use application/octet aka binary/octet-stream still stands. However, my estimate of its benefit has reduced. Having experimented with the code, I estimate it would improve performance in the rage of 25%. (Previous to experimenting, I was guessing that the potential improvement was much larger.)

NCATSTranslator / Knowledge_Graph_Exchange_Registry

Efficient browser upload #53