Open jeffhhk opened 3 years ago
Does this hold with binary/octet-stream
as well?
Boto3 uses this as its default content type: https://github.com/boto/boto3/issues/548#issuecomment-296861474
Yes binary/octet-stream is synonymous with application/octet.
I played with this code quite a bit yesterday. My original suggestion was about the browser to server connection bandwidth. However, in your code base, the far bigger issue is the server to S3 connection bandwidth.
It is admirable of you to try to stream straight to S3 without making any temporary files at all. However, in practice fast uploads to S3 can be difficult, especially trying to stream end to end.
I'm working on concrete recommendations for the tighter bottleneck.
Thanks. For the direct url upload, I've added an application/octet
header for smart_open
when opening the source URL.
I played with this code quite a bit yesterday. My original suggestion was about the browser to server connection bandwidth. However, in your code base, the far bigger issue is the server to S3 connection bandwidth.
It is admirable of you to try to stream straight to S3 without making any temporary files at all. However, in practice fast uploads to S3 can be difficult, especially trying to stream end to end.
I'm working on concrete recommendations for the tighter bottleneck.
Hi @jeffhhk, we may need to be mindful of the global architecture here: remember that the production server runs on EC2 which ought to have a privileged connection to S3. When running the code in development mode on one's own machine, does however mean that the server-S3 connection will be much relatively slower than the local file directory-> browser -> (local) server connection. The opposite is likely true of the production system running on EC2: the server-S3 connection will trump the user machine->(remote) server connection. From my experience, years ago, the AWS internal network between various AWS components is blazingly fast. The mainstream internet, not so much so.
The primary bottleneck in community use is more likely to be the segment of the upload flow from user machine to EC2 server, and less, the EC2 server to S3 bucket. Injecting a temporary file on the server to catch the file, is not likely going to help a great deal.
That said, in a fashion, the existing code base (or at least, some of its 3rd party libraries) are very much already playing the "local temp file" card (or facsimile thereof, in memory streaming using BytesIO file objects).
Kenneth and I were taking a closer look at this yesterday, wondering how to improve upon the situation. Jury still out. Our review, however, mainly focused on improving the efficiency the post-processing (tar.gz archiving) step.
One last point I'd make at this time is that we should remember our degrees of freedom: we are developing KGE as an AWS application. Therefore, maybe we need to ponder what kind of EC2 instance configuration will get us where we want to go. For example, there are memory optimized instances that also have NVMe SSD drives attached. Maybe we could leverage that in some manner to enhance performance for large files.
Disregard this comment of mine: "the far bigger issue is the server to S3 connection bandwidth." I made a mistake when I set up some of the measurements I did.
The original suggestion to use application/octet aka binary/octet-stream still stands. However, my estimate of its benefit has reduced. Having experimented with the code, I estimate it would improve performance in the rage of 25%. (Previous to experimenting, I was guessing that the potential improvement was much larger.)
When implementing streaming HTTP uploads of known size, the fastest performance I have achieved has been with the header "Content-Type: application/octet-stream", because it takes away the client and server responsibility to cipher the stream.
application/octet posts can be simulated using curl.
Here's the curl flag to do large post with Content-Type: application/octet-stream type:
Here's a whole example:
With context, coded against a python server:
XMLHttpRequest in modern browsers have the ability to send application/octet-stream requests. Here is an example: