high memory usage caused by chunks being combined in-mem

trondhindenes commented 2 years ago

On large file uploads we're seeing heavy memory usage, roughly 2x the size of an upload while finalizing the upload. After a lot of digging, it turns out that the culprit seems to be how django-drf-filepond pieces together chunks by using an in-mem BytesIO object. The "2x" usage seems to be related to copies of that BytesIO object.

We've created a poc that's using python tempfiles instead of BytesIO, and moves files rather than copies them, in order to avoid slowing down the upload finalizer. We're prepared to spend time cleaning this up and creating a PR for that, but before doing so I was wondering if we should attempt to make the choice of in-mem or tempfile configurable using a setting? Doing so would cause some extra complexities, so it would take us some extra time but in terms of backwards compatability it would be safer I guess. For smaller uploads I guess the in-mem method is slightly faster too. Maybe we could even make it so that small uploads are pieced together in-mem while larger ones are built using tempfiles.

In any case, would be good to get some feedback from the maintainers on this.

jcohen02 commented 2 years ago

Thanks for highlighting this issue, @trondhindenes, and taking the time to invesitgate it. When django-drf-filepond was originally created, I have to say that the handling of very large file uploads was not something that was really considered in detail. However, I now see from this and a couple of other recent issues that this is an important use case so it would be great to resolve this.

I wonder if the "2x" memory usage that you're seeing is coming from the fact that the chunks are loaded into the BytesIO object and then this is used to create an InMemoryUploadedFile object in uploaders.py (L345). I haven't had a chance to investigate yet but I wonder if the creation of the InMemoryUploadedFile object is taking a copy of the data into a new object rather than just wrapping the provided BytesIO object? If so, maybe this is something that can be worked around.

More generally, this approach was originally taken so that a suitable file object can be passed in the creation of the TemporaryUpload object since this then uses the underlying file storage object to save the file to the location configured in the settings. We'd need to ensure that any fix made here will retain this general workflow since I'm hoping to add support for using remote storage (e.g. via django-storages ) and this will rely on taking the data passed when creating the TemporaryUpload object and sending it to the temporary storage location configured in the settings.

For chunked uploads, as an alternative to the use of BytesIO and the creation of an InMemoryUploadedFile object, I wonder if we could just use the equivalent of cat or a similar system command to join the chunks together on disk into a single large file and then wrap this with a TemporaryUploadedFile object, and use this to pass to TemporaryUpload? Is this effectively what your tempfile-based poc does?

I think the general aim, where possible, was to avoid writing files to disk for both performance and security reasons. However, where we're looking at large uploads that need to be chunked, if the upload fails due to poor connectivity or other issues, it was decided that providing the ability to resume an upload rather than waste time/bandwidth re-uploading all the data that had already been successfully transferred was a primary aim since files are ultimately going to be stored on disk anyway when the TemporaryUpload/StoredUpload objects are created.

If you'd like to open a PR to resolve this issue, please feel free to go ahead, all contributions much appreciated. Equally, if you'd like me to make updates, I can do this but there will be a bit of a delay due to other work.

trondhindenes commented 2 years ago

Hi, thanks for responding. I think your assumption about the 2x memory usage is correct.

I'm not too familiar with the codebase to understand every nuance and consequence, but I can tell you what we've done to get around it short-term:

instead of loading a bytesio object, we use a temporary file

        # Load each of the file parts into a tempfile
        chunk_dir = os.path.join(storage.base_location, tuc.upload_dir)
        fd, temp_file_path = tempfile.mkstemp()
        with os.fdopen(fd, 'wb') as file_data:
            for i in range(1, tuc.last_chunk+1):
                chunk_file = os.path.join(chunk_dir, '%s_%s' % (tuc.file_id, i))
                if not os.path.exists(chunk_file):
                    raise FileNotFoundError('Chunk file not found for chunk <%s>'
                                            % (i))

                with open(chunk_file, 'rb') as cf:
                    file_data.write(cf.read())

We don't create an inmemoryfile at all. We then move the constructed temp file to the "chunk folder":

        assembled_tempfile_pth = os.path.join(chunk_dir, tuc.upload_id)
        shutil.move(temp_file_path, assembled_tempfile_pth)

This required us to make a few changes to api.py and the model as well, so I wouldn't consider it clean enough for a pull request. I'm not sure how much of the behavior (use tempfile or in-mem object) you'd want to be controllable via Django settings, and it feels that that would affect the shape of the PR. If you'd rather do this yourself to get it right (in your own time) then we'd totally understand ofc, and continue running our own "hacked" version for the time being.

jcohen02 commented 2 years ago

Hi @trondhindenes, apologies for the long delay in implementing a comprehensive fix for this issue but I've now finished implementing a new DrfFilepondChunkedUploadedFile object that replaces the use of InMemoryUploadedFile when handling chunked file uploads.

As you observed when originally reporting this issue, previously, for a chunked upload, the separate file chunks stored to disk were being reconstituted into a complete file in memory and this InMemoryUploadedFile object was then being passed as the content for the FileField in the TemporaryUpload object.

The new DrfFilepondChunkedUploadedFile object represents a wrapper around the chunk files stored on disk and is passed directly to the FileField in the TemporaryUpload object. When Django handles the saving of the file by calling the read function on the provided file object, the DrfFilepondChunkedUploadedFile handles the passing of the data from the relevant chunk files on disk. This ensures that we don't need to reconstitue the chunks into separate file in memory first. The memory usage of the new approach will depend on the internal Django implementation but it will ensure that the original problem of memory usage of 2*file size is no longer occurring.

The new functionality has just been merged in #82 and will released in a release candidate version of 0.5.0 in the next couple of days. If you're still working with this, feel free to go ahead and test the code in main (or try 0.5.0rc1 when this is available) and provide any feedback.

trondhindenes commented 2 years ago

thanks for letting me know!

jcohen02 commented 1 year ago

Just to confirm that this feature has been included in release v0.5.0 which has just been made available on PyPi.

ImperialCollegeLondon / django-drf-filepond

high memory usage caused by chunks being combined in-mem #64