Closed trondhindenes closed 2 years ago
Thanks for highlighting this issue, @trondhindenes, and taking the time to invesitgate it. When django-drf-filepond was originally created, I have to say that the handling of very large file uploads was not something that was really considered in detail. However, I now see from this and a couple of other recent issues that this is an important use case so it would be great to resolve this.
I wonder if the "2x" memory usage that you're seeing is coming from the fact that the chunks are loaded into the BytesIO
object and then this is used to create an InMemoryUploadedFile
object in uploaders.py
(L345). I haven't had a chance to investigate yet but I wonder if the creation of the InMemoryUploadedFile
object is taking a copy of the data into a new object rather than just wrapping the provided BytesIO
object? If so, maybe this is something that can be worked around.
More generally, this approach was originally taken so that a suitable file object can be passed in the creation of the TemporaryUpload
object since this then uses the underlying file storage object to save the file to the location configured in the settings. We'd need to ensure that any fix made here will retain this general workflow since I'm hoping to add support for using remote storage (e.g. via django-storages ) and this will rely on taking the data passed when creating the TemporaryUpload
object and sending it to the temporary storage location configured in the settings.
For chunked uploads, as an alternative to the use of BytesIO
and the creation of an InMemoryUploadedFile
object, I wonder if we could just use the equivalent of cat
or a similar system command to join the chunks together on disk into a single large file and then wrap this with a TemporaryUploadedFile
object, and use this to pass to TemporaryUpload
? Is this effectively what your tempfile-based poc does?
I think the general aim, where possible, was to avoid writing files to disk for both performance and security reasons. However, where we're looking at large uploads that need to be chunked, if the upload fails due to poor connectivity or other issues, it was decided that providing the ability to resume an upload rather than waste time/bandwidth re-uploading all the data that had already been successfully transferred was a primary aim since files are ultimately going to be stored on disk anyway when the TemporaryUpload
/StoredUpload
objects are created.
If you'd like to open a PR to resolve this issue, please feel free to go ahead, all contributions much appreciated. Equally, if you'd like me to make updates, I can do this but there will be a bit of a delay due to other work.
Hi, thanks for responding. I think your assumption about the 2x memory usage is correct.
I'm not too familiar with the codebase to understand every nuance and consequence, but I can tell you what we've done to get around it short-term:
instead of loading a bytesio object, we use a temporary file
# Load each of the file parts into a tempfile
chunk_dir = os.path.join(storage.base_location, tuc.upload_dir)
fd, temp_file_path = tempfile.mkstemp()
with os.fdopen(fd, 'wb') as file_data:
for i in range(1, tuc.last_chunk+1):
chunk_file = os.path.join(chunk_dir, '%s_%s' % (tuc.file_id, i))
if not os.path.exists(chunk_file):
raise FileNotFoundError('Chunk file not found for chunk <%s>'
% (i))
with open(chunk_file, 'rb') as cf:
file_data.write(cf.read())
We don't create an inmemoryfile
at all.
We then move the constructed temp file to the "chunk folder":
assembled_tempfile_pth = os.path.join(chunk_dir, tuc.upload_id)
shutil.move(temp_file_path, assembled_tempfile_pth)
This required us to make a few changes to api.py
and the model as well, so I wouldn't consider it clean enough for a pull request. I'm not sure how much of the behavior (use tempfile or in-mem object) you'd want to be controllable via Django settings, and it feels that that would affect the shape of the PR.
If you'd rather do this yourself to get it right (in your own time) then we'd totally understand ofc, and continue running our own "hacked" version for the time being.
Hi @trondhindenes, apologies for the long delay in implementing a comprehensive fix for this issue but I've now finished implementing a new DrfFilepondChunkedUploadedFile
object that replaces the use of InMemoryUploadedFile
when handling chunked file uploads.
As you observed when originally reporting this issue, previously, for a chunked upload, the separate file chunks stored to disk were being reconstituted into a complete file in memory and this InMemoryUploadedFile object was then being passed as the content for the FileField
in the TemporaryUpload
object.
The new DrfFilepondChunkedUploadedFile
object represents a wrapper around the chunk files stored on disk and is passed directly to the FileField
in the TemporaryUpload
object. When Django handles the saving of the file by calling the read
function on the provided file object, the DrfFilepondChunkedUploadedFile
handles the passing of the data from the relevant chunk files on disk. This ensures that we don't need to reconstitue the chunks into separate file in memory first. The memory usage of the new approach will depend on the internal Django implementation but it will ensure that the original problem of memory usage of 2*file size is no longer occurring.
The new functionality has just been merged in #82 and will released in a release candidate version of 0.5.0 in the next couple of days. If you're still working with this, feel free to go ahead and test the code in main (or try 0.5.0rc1 when this is available) and provide any feedback.
thanks for letting me know!
Just to confirm that this feature has been included in release v0.5.0 which has just been made available on PyPi.
On large file uploads we're seeing heavy memory usage, roughly 2x the size of an upload while finalizing the upload. After a lot of digging, it turns out that the culprit seems to be how django-drf-filepond pieces together chunks by using an in-mem
BytesIO
object. The "2x" usage seems to be related to copies of thatBytesIO
object.We've created a poc that's using python tempfiles instead of
BytesIO
, and moves files rather than copies them, in order to avoid slowing down the upload finalizer. We're prepared to spend time cleaning this up and creating a PR for that, but before doing so I was wondering if we should attempt to make the choice of in-mem or tempfile configurable using a setting? Doing so would cause some extra complexities, so it would take us some extra time but in terms of backwards compatability it would be safer I guess. For smaller uploads I guess the in-mem method is slightly faster too. Maybe we could even make it so that small uploads are pieced together in-mem while larger ones are built using tempfiles.In any case, would be good to get some feedback from the maintainers on this.