IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
155 stars 109 forks source link

Running out of disk space in ci/cd tests #516

Closed daw3rd closed 1 month ago

daw3rd commented 1 month ago

The new PII PR #471 is pushing our disk space requirements over a threshold causing failures unrelated to PII. We need to find a way to reduce disk space usage during tests - most immediately make test-src in the transforms tree. One approach might be to clear the venv after test-src is run. Altneratively, migth be the following:

This needs to fix the out of disk space problem, likely with changes in PR #511

@daw3rd @touma-I I hope the above PR will resolve this issue. Just an observation from language package transforms. text_encoder and pii_redactor use sentencetransformers and Flair model an approximate disk space usage

  1. Base System and Dependencies
  • Ubuntu OS and Basic Tools: ~3-5 GB (already part of the runner)
  • Python Packages and Dependencies:
  • PyTorch: ~1-2 GB
  • Hugging Face Transformers: ~500 MB
  • Flair: ~100 MB
  • SentenceTransformers: ~200 MB
  • Other dependencies (NumPy, SciPy, etc.): ~500 MB
  1. Model Files
  • Flair Models:
  • Standard Flair models (e.g., POS tagging, NER): ~300-500 MB each
  • SentenceTransformers Models:
  • paraphrase-MiniLM-L6-v2: ~100 MB
  • distilbert-base-nli-stsb-mean-tokens: ~300 MB
  • Other models can range from ~100 MB to 1 GB
  1. Cache Files
  • PyTorch Cache:
  • PyTorch model weights: ~500 MB - 2 GB
  • Transformers Cache:
  • Tokenizers and additional model files: ~500 MB - 1 GB
  1. Temporary Files During Execution
  • Intermediate files for model loading and processing: ~500 MB - 1 GB

Minimum: ~ 5-8 GB (using minimal models and dependencies) and approximate disk space usage. Also I could see df -H stats as

Run df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   55G   1[9](https://github.com/IBM/data-prep-kit/actions/runs/10430215947/job/28897320924#step:3:10)G  75% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      [10](https://github.com/IBM/data-prep-kit/actions/runs/10430215947/job/28897320924#step:3:11)5M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   [12](https://github.com/IBM/data-prep-kit/actions/runs/10430215947/job/28897320924#step:3:13)K  1.6G   1% /run/user/1001
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   21G   53G  29% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda[15](https://github.com/IBM/data-prep-kit/actions/runs/10430215947/job/28897320924#step:3:16)      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

I feel moving the installation directory and any temporary files to the /mnt partition, which has significantly more space (around 66 GB). We can do this by setting environment variables like TMPDIR to point to /mnt.Yes we should validate this if there are cons. Let me know your thoughts.

@daw3rd still the same issue ‘no space left’.

@SowmyaLR Many thanks for trying this. I will be starting a new issue based on your findings and I will be making direct changes to your branch to disable the failed test. Please consider starting a new Issue and a new PR for addling more transformers. For this one, I will take it from here and I will keep you informed on what we did prior to merging it with dev. Thanks again for your contributions

Originally posted by @touma-I in https://github.com/IBM/data-prep-kit/issues/471#issuecomment-2296838509

daw3rd commented 1 month ago

A fix has been applied in PR #538 to clean each venv as part of a transform's test-src target.