'Update recent download' hash process uses massive amounts of RAM

Avnsx / fansly-downloader

Easy to use fansly.com content downloading tool. Written in python, but ships as a standalone Executable App for Windows too. Enjoy your Fansly content offline anytime, anywhere in the highest possible content resolution! Fully customizable to download in bulk or single: photos, videos & audio from timeline, messages, collection & specific posts 👍

https://fansly.com/

GNU General Public License v3.0

1.27k stars 63 forks source link

'Update recent download' hash process uses massive amounts of RAM #58

Closed akunohomu closed 1 year ago

akunohomu commented 1 year ago

These lines

https://github.com/Avnsx/fansly/blob/642259d03da67b52ab188a7532cb2ab0d2afa2c8/fansly_scraper.py#L180-L182

read the entire video into memory, which might be multiple GB. Files should instead be read in chunks and the hasher updated with each chunk.

This area

https://github.com/Avnsx/fansly/blob/642259d03da67b52ab188a7532cb2ab0d2afa2c8/fansly_scraper.py#L208-L224

does this concurrently. The default number of threads is min(32, 5 * vcores).

I attempted to update a recent download and the python process used >25GB of RAM, leading to my OS to fail and need a reboot because system processes ran out of memory.

As a hotfix, with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor: appears to work (probably only as long as none of the files to be hashed exceed total free RAM in size). Memory usage peaked at ~4GB.

Multi-threaded I/O is generally only faster on SSDs, it's slower on hard drives.

Avnsx commented 1 year ago

Hey thanks for opening this issue ticket.

Could you remove max_workers=1 and replace the current process_vid() function with

def process_vid(filePath):
    h = hashlib.md5()
    with open(filePath, 'rb') as f:
        while (part := f.read(1_048_576)):
            h.update(part)
    recent_videobyte_hashes.append(h.hexdigest())

and report back if that solves your issue?

After replacing that, it would be surprising for my code, to eat 25GB of RAM again - unless your images are huge, which is unlikely since fansly uses compression on each of them.

akunohomu commented 1 year ago

Your solution appears to be working, peaking at 4GB during download process now. Ideally downloads would also be chunked/streamed, but since it does now work I think this issue can be closed out.

(While not relevant to this issue, some way to entirely skip download of already-present files would be nice, to save on time/data usage)

Avnsx commented 1 year ago

While not relevant to this issue, some way to entirely skip download of already-present files would be nice, to save on time/data usage

I'm open for Ideas and pull requests 🙂