bottlepy / bottle

bottle.py is a fast and simple micro-framework for python web-applications.
http://bottlepy.org/
MIT License
8.45k stars 1.46k forks source link

Allow for zero-copy file uploads. #288

Open jtolio opened 12 years ago

jtolio commented 12 years ago

Right now, Bottle directly calls cgi.FieldStorage to parse incoming data.

cgi.FieldStorage has a 'make_file' method defined with the following comment:

def make_file(self, binary=None):
    """Overridable: return a readable & writable file.

    The file will be used as follows:
    - data is written to it
    - seek(0)
    - data is read from it

    The 'binary' argument is unused -- the file is always opened
    in binary mode.

    This version opens a temporary file for reading and writing,
    and immediately deletes (unlinks) it.  The trick (on Unix!) is
    that the file can still be used, but it can't be opened by
    another process, and it will automatically be deleted when it
    is closed or when the current process terminates.

    If you want a more permanent file, you derive a class which
    overrides this method.  If you want a visible temporary file
    that is nevertheless automatically deleted when the script
    terminates, try defining a __del__ method in a derived class
    which unlinks the temporary files you have created.

    """

It would be nice if Bottle allowed more easy access to providing your own FieldStorage class implementation, or perhaps specifically allowed for optionally providing a make_file implementation.

My use case is that I would like to have Bottle upload file data directly to its final storage location, instead of being uploaded to a tempfile and then being copied again to its final destination.

jtolio commented 12 years ago

see https://github.com/defnull/bottle/pull/289

Nbelles commented 2 years ago

I would love to see this implemented if possible. There is still interest in this use case. I tried implementing the pull request on my local machine but there have been so many changes since this was suggested that it was returning some errors. If it is possible to have this option implemented, that would be awesome. I am happy to help with any testing!

knro commented 2 years ago

This is 10 years old, but not yet implemented? I have an issue whereas very large file uploaded 2GB+ fail on Raspberry PI. I also want to save directly to a final destination and skip temporary file altogether.

defnull commented 2 years ago

If large files fail, then that's probably because the default temp directory is a size-constrained tmpfs. As a quick fix, you may simply use a temp directory on the target file system instead. See https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir

Nbelles commented 2 years ago

I haven't had any issues with large file uploads failing at all but I have been running it on a fairly beefy computer with lots of ram and available swap and storage space. I too want to save a file directly to the final destination skipping the temporary file altogether as well. Not only is transferring the file to its destination undesirable but the time it takes to transfer from the place in memory to the final destination is the part that I really would like to see resolved. With large files, it now takes the time for the file to transfer across the network plus the time for the file to be copied from memory to disk before the file can be handled/processed and a response can be sent to the user. I tried switching my save destination to a solid state drive and it still takes a really long time for large files.

Nbelles commented 2 years ago

@defnull Could you describe in greater detail what you mean by "use a temp directory on the target file system instead"? Maybe with a short code example of how it would work? I am not currently seeing how this would resolve the issue with large files.

From my testing with bottle, when a post is sent to a bottle app, the form data is not sent until the line where the bottle.request variable is referenced in the route function. Then the bottle server allows the client to start sending the file so that it can parse the form data and get the full bottle.request variable. This also triggers some slower file transfer or something in the background that doesn't allow the request variable to be accessible until it is done moving the file. This is the thing that is currently holding back large files from being handled quickly. Once the file is accessible through the variable, using the f.save() function quickly moves the file from its previous location to the desired location and the rest of the response can be processed.

I will be doing some more testing soon to see if I can quantify exactly what it is that is happening that causes the slowdown somewhere there in the middle but in the meantime, it would be nice to see some more support on how to handle large file uploads, especially with a zero-copy abilities.

Nbelles commented 2 years ago

I put together a short little example script to help demonstrate the part that takes a long time. The line with the for loop is where the time delay is happening for me. After the client has finished sending the file to the server, there is still a delay of a couple of seconds before the next line "Form data transfer complete" appears on the screen. From what I can tell from my very limited viewpoint of CPU/memory/disk activity, it seems as though the script is writing something to disk but in a very slow manner (not nearly the speeds it was able to handle when receiving the file over the network).

import bottle
import time

host, port = "127.0.0.1", 8000

app = bottle.Bottle()

@app.route("/upload", method="GET")
def serve_upload():
    return """
    <form enctype="multipart/form-data" action="/upload" method="POST">
        Choose a file to upload: <input name="file" type="file" multiple/><br />
        <input type="submit" value="Send" />
    </form>
    """

@app.route("/upload", method="POST")
def receive_upload():
    print("Endpoint hit, no form data transmitted yet.")

    time.sleep(10)

    print("Starting form data transfer from client to server...")
    for f in bottle.request.files.getall("file"):
        print("Form data transfer complete.")

        time.sleep(10)

        print("Saving file to path...")

        f.save("/tmp/server/" + f.filename)

        print("File now available in filesystem at provided path.")

app.run(host=host, port=port)
defnull commented 2 years ago

We are getting a little of-topic or into too much detail here, but I'll try to explain anyway. For science \o/

From my testing with bottle, when a post is sent to a bottle app, the form data is not sent until the line where the bottle.request variable is referenced in the route function. Then the bottle server allows the client to start sending the file so that it can parse the form data and get the full bottle.request variable. This also triggers some slower file transfer or something in the background that doesn't allow the request variable to be accessible until it is done moving the file. This is the thing that is currently holding back large files from being handled quickly. Once the file is accessible through the variable, using the f.save() function quickly moves the file from its previous location to the desired location and the rest of the response can be processed.

So, first of all, bottle is a WSGI framework, which means it sits on top of a HTTP server (e.g. nginx) and a WSGI server (e.g. gunicorn) and does not parse HTTP itself. It receives the request as a dict full of already parsed headers and other metadata. Large requests with a request body are usually only parsed up until headers are available. The HTTP server then just stops reading from the socket and waits for the application (bottle) to request more data. Once the network buffers are full, the HTTP client won't be able to write more data to the socket and also has to wait.

Bottle exposes a file-like object that allows an application to read the request body as request.body. For small requests (<100k) this is a memory buffer. For large requests, this is just a wrapper around the actual socket. So, reading from request.body will free up space in the internal buffers of the HTTP server and the client can send more data.

From data requests have a body that must also be parsed. Bottle uses the cgi.FieldStorage implementation from python. This parses the entire body in one go, but has to put the large file uploads somewhere while parsing. To do that, it creates unnamed temporary files using tempfile.TemporaryFile. These files are placed in the systems temp directory by default. You can change that by setting environment variables (e.g. TMPDIR). Parsing multipart/form+data is annoyingly slow, because of how this old protocol is designed. It was meant for emails, not multi-GB file uploads. But that is all we have today.

After everything is parsed, your application usually calls FormFile.save() to copy the uploaded file to its destination. This is a copy operation because cgi.FieldStorage creates unnamed files and python does not expose OS functionality to copy or move unnamed files efficiently. This PR wants to allow users to switch tempfile.TemporaryFile for tempfile.NamedTemporaryFile in some cases so the temp file has a path and can be moved (instead of copied) to its destination. Nice idea, I only struggle with the implementation and the complexity it adds for a rather rather narrow use case. It also requires that your temp directory is on the same file system as your destination or a file move is just copy+delete again. You cannot move files between file systems without copying.

tl:dr: If the time required by request.files is your bottle-neck, then this PR don't help you. Parsing multipart requests is slow, we have to live with that. If you want to make file.save(...) faster, then maybe this PR helps a bit. The last copy operation could be saved, but is not the bottleneck in most applications, so I'm unsure if it's worth it.

Nbelles commented 2 years ago

@defnull Thank you for explaining in such great detail.

I think if we are unable to decrease the time taken to parse the multipart requests, then it would be nice to decrease the time required to move the files to their final resting place.

To make that happen, it sounds like the ideal scenario would be using the TMPDIR method to set the directory where the file will be written, combined with a tempfile.NamedTemporaryFile such that the file can easily be renamed to its final name without needing to be moved again. This would ensure that you are writing to the destination file system when parsing the file and then instead of using file.save(...) the designer would just have to rename the file instead of moving the file anywhere. Does this make sense?

Is there a place where I can find more info about using the TMPDIR environment variable method to change the location where the file is place? I hadn't heard of doing something like this before you mentioned it. Thanks!

defnull commented 2 years ago

Is there a place where I can find more info about using the TMPDIR environment variable method to change the location where the file is place? I hadn't heard of doing something like this before you mentioned it. Thanks!

https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir

The problem with tempfile.NamedTemporaryFile is that these are not removed automatically. Bottle should be aware of that and delete them after the request if they were not moved by the application.

Nbelles commented 2 years ago

When bottle parses the form data and generates a temporary file using tempfile.TemporaryFile(), the temporary file is generated in the directory that is returned by tempfile.gettempdir(). You are suggesting that I change the TMPDIR environment variable (which updates tempfile.gettempdir()) so that the file always ends up being written to the desired file system? I'm not super familiar with environment variables in python, is this something that can be set within the python script?

Nbelles commented 2 years ago

I see how files not being removed automatically could be a problem. I might be alone in this but I am willing to accept the risk of personally having to ensure the deletion of any files coming in if it means I can get the performance gain of not waiting for the file to be copied again. This is what I think the original poster was hoping for as well.