Closed akunohomu closed 1 year ago
Hey thanks for opening this issue ticket.
Could you remove max_workers=1
and replace the current process_vid()
function with
def process_vid(filePath):
h = hashlib.md5()
with open(filePath, 'rb') as f:
while (part := f.read(1_048_576)):
h.update(part)
recent_videobyte_hashes.append(h.hexdigest())
and report back if that solves your issue?
After replacing that, it would be surprising for my code, to eat 25GB of RAM again - unless your images are huge, which is unlikely since fansly uses compression on each of them.
Your solution appears to be working, peaking at 4GB during download process now. Ideally downloads would also be chunked/streamed, but since it does now work I think this issue can be closed out.
(While not relevant to this issue, some way to entirely skip download of already-present files would be nice, to save on time/data usage)
While not relevant to this issue, some way to entirely skip download of already-present files would be nice, to save on time/data usage
I'm open for Ideas and pull requests 🙂
These lines
https://github.com/Avnsx/fansly/blob/642259d03da67b52ab188a7532cb2ab0d2afa2c8/fansly_scraper.py#L180-L182
read the entire video into memory, which might be multiple GB. Files should instead be read in chunks and the hasher updated with each chunk.
This area
https://github.com/Avnsx/fansly/blob/642259d03da67b52ab188a7532cb2ab0d2afa2c8/fansly_scraper.py#L208-L224
does this concurrently. The default number of threads is
min(32, 5 * vcores)
.I attempted to update a recent download and the python process used >25GB of RAM, leading to my OS to fail and need a reboot because system processes ran out of memory.
As a hotfix,
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
appears to work (probably only as long as none of the files to be hashed exceed total free RAM in size). Memory usage peaked at ~4GB.Multi-threaded I/O is generally only faster on SSDs, it's slower on hard drives.