explosion / weasel

🦦 weasel: A small and easy workflow system
MIT License
63 stars 8 forks source link

MemoryError on computing checksums for large files #79

Open oroszgy opened 6 months ago

oroszgy commented 6 months ago

When creating a command which depends on a large file (which cannot be fitted into memory), weasel still tries to load the whole file which results in a MemoryError.

The traceback for such a run:


  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/bin/weasel", line 8, in <module>                                                                                                                                                  sys.exit(app())                                                                                                                                                                                                                         

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 42, in project_run_cli                                                                                                      project_run(                            

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 88, in project_run
    project_run(                                                                                                                                                                                                                                                                                       
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 113, in project_run
    update_lockfile(current_dir, cmd)                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 270, in update_lockfile
    data[command["name"]] = get_lock_entry(project_dir, command)
                                                                                                                                                                                                                                              File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 286, in get_lock_entry
    deps = get_fileinfo(project_dir, command.get("deps", []))

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 308, in get_fileinfo
    md5 = get_checksum(file_path) if file_path.exists() else None

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/util/hashing.py", line 33, in get_checksum
    return hashlib.md5(Path(path).read_bytes()).hexdigest() 

  File "/home/gorosz/Applications/miniconda3/lib/python3.10/pathlib.py", line 1127, in read_bytes
    return f.read()

MemoryError
svlandeg commented 6 months ago

This happens because Weasel checks the dependencies of a command and whether they've changed or not. To prevent this from happening, the large file should simply not be listed as output or input to a given command - then it won't be processed / validated.

oroszgy commented 6 months ago

Thanks. I just discovered this workaround for myself as well. Do you think it's feasible to use the last modification date instead of hashes? Alternatively, would it be a solution to compute hashes for file chunks to address this issue (refer to https://stackoverflow.com/questions/1131220/get-the-md5-hash-of-big-files-in-python)?