The new PII PR #471 is pushing our disk space requirements over a threshold causing failures unrelated to PII. We need to find a way to reduce disk space usage during tests - most immediately make test-src in the transforms tree.
One approach might be to clear the venv after test-src is run. Altneratively, migth be the following:
This needs to fix the out of disk space problem, likely with changes in PR #511
@daw3rd @touma-I I hope the above PR will resolve this issue. Just an observation from language package transforms. text_encoder and pii_redactor use sentencetransformers and Flair model an approximate disk space usage
Base System and Dependencies
Ubuntu OS and Basic Tools: ~3-5 GB (already part of the runner)
Python Packages and Dependencies:
PyTorch: ~1-2 GB
Hugging Face Transformers: ~500 MB
Flair: ~100 MB
SentenceTransformers: ~200 MB
Other dependencies (NumPy, SciPy, etc.): ~500 MB
Model Files
Flair Models:
Standard Flair models (e.g., POS tagging, NER): ~300-500 MB each
SentenceTransformers Models:
paraphrase-MiniLM-L6-v2: ~100 MB
distilbert-base-nli-stsb-mean-tokens: ~300 MB
Other models can range from ~100 MB to 1 GB
Cache Files
PyTorch Cache:
PyTorch model weights: ~500 MB - 2 GB
Transformers Cache:
Tokenizers and additional model files: ~500 MB - 1 GB
Temporary Files During Execution
Intermediate files for model loading and processing: ~500 MB - 1 GB
Minimum: ~ 5-8 GB (using minimal models and dependencies) and approximate disk space usage.
Also I could see df -H stats as
I feel moving the installation directory and any temporary files to the /mnt partition, which has significantly more space (around 66 GB). We can do this by setting environment variables like TMPDIR to point to /mnt.Yes we should validate this if there are cons. Let me know your thoughts.
@daw3rd still the same issue ‘no space left’.
@SowmyaLR Many thanks for trying this. I will be starting a new issue based on your findings and I will be making direct changes to your branch to disable the failed test. Please consider starting a new Issue and a new PR for addling more transformers. For this one, I will take it from here and I will keep you informed on what we did prior to merging it with dev. Thanks again for your contributions
The new PII PR #471 is pushing our disk space requirements over a threshold causing failures unrelated to PII. We need to find a way to reduce disk space usage during tests - most immediately
make test-src
in the transforms tree. One approach might be to clear the venv after test-src is run. Altneratively, migth be the following:@SowmyaLR Many thanks for trying this. I will be starting a new issue based on your findings and I will be making direct changes to your branch to disable the failed test. Please consider starting a new Issue and a new PR for addling more transformers. For this one, I will take it from here and I will keep you informed on what we did prior to merging it with dev. Thanks again for your contributions
Originally posted by @touma-I in https://github.com/IBM/data-prep-kit/issues/471#issuecomment-2296838509