fyoorer / ShadowClone

Unleash the power of cloud
Apache License 2.0
672 stars 86 forks source link

Issue: OSError: [Errno 24] Too many open files when running shadowclone.py #29

Open WarrDaddy opened 1 year ago

WarrDaddy commented 1 year ago

When running shadowclone.py with a large input file, the program fails with the following error message:

(env) ➜  ShadowClone git:(main) ✗ python shadowclone.py -i results_github.txt -s 1000 -o output_shadow -c "/go/bin/httpx -l {INPUT}"
2023-05-05 14:39:52,314 [INFO] Splitting input file into chunks of 1000 lines
2023-05-05 14:39:52,410 [INFO] Uploading chunks to storage
runtime-shadowclone/dbb8cfdc-92ed-4b8d-973e-d21edad30cd0
Traceback (most recent call last):
  File "/Users/hnguyen/arsenal/ShadowClone/shadowclone.py", line 174, in <module>
    filekeys = pool.map(upload_to_bucket, chunks)
  File "/Users/hnguyen/.pyenv/versions/3.9.5/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/hnguyen/.pyenv/versions/3.9.5/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/Users/hnguyen/.pyenv/versions/3.9.5/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/hnguyen/.pyenv/versions/3.9.5/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/hnguyen/arsenal/ShadowClone/shadowclone.py", line 44, in upload_to_bucket
    chunk_hash = perform_hashing(chunk)
  File "/Users/hnguyen/arsenal/ShadowClone/hasher.py", line 8, in perform_hashing
    return hashlib.md5(file_as_bytes(open(fname, 'rb'))).hexdigest()
OSError: [Errno 24] Too many open files: '/var/folders/q2/v0knyw_n45598tj6xm9clf4r0000gp/T/small_file_100000.txt'

This error occurs because the program is trying to open too many files at once and has exceeded the system limit on the number of files that can be open simultaneously.

Steps to reproduce:

Run shadowclone.py with a large input file. Wait for the program to fail with the above error message. Expected behavior: The program should be able to process large input files without exceeding the system limit on the number of open files.

Actual behavior: The program fails with an OSError due to too many open files.

Proposed solution: One possible solution to this issue is to modify the upload_to_bucket() function to use a with statement to automatically close the file after it has been read. This would prevent the number of open files from exceeding the system limit.

Note: I have the fix and I could submit a pull request.