CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

Inefficient usage of requests.get for very large videos causes event gather to fail #235

Closed whargrove closed 1 year ago

whargrove commented 1 year ago

Describe the Bug

Usage of requests to write the content to the open destination file as a BufferedWriter causes unconstrained memory usage based on the size of the video file being copied from the remote origin.

For example, consider the mp4 at https://sg001-media.sliq.net/00309/Media2/2023/04/27/House%20Floor%20Session-House%20Chamber-080505_46177_1113.mp4...

curl -Ss -I https://sg001-media.sliq.net/00309/Media2/2023/04/27/House%20Floor%20Session-House%20Chamber-080505_46177_1113.mp4 | grep ^content-length
content-length: 8068641879

The video is ~8GB in size and uses too much memory for n1-standard-4 machine type we use by default in Google Cloud especially when whisper model is already loaded.

Expected Behavior

Downloading video files from remote origin should use a pattern that constrains memory usage in the python process.

Reproduction

import requests

with open("movie.mp4", "wb") as f:
    f.write(
        requests.get("https://sg001-media.sliq.net/00309/Media2/2023/04/27/House%20Floor%20Session-House%20Chamber-080505_46177_1113.mp4").content
    )

Running this will download the 8G file and copy to output file. The memory usage will be unconstrained since requests to will load the entire body into memory before copying the file to disk.

Environment