Closed connortann closed 1 year ago
Comparison of various options I've tried for caching dependencies.
Nb. we can see & manage caches via UI: https://github.com/dsgibbons/shap/actions/caches
Repository caches limited to 10GB.
Env | Baseline | 1: Cache pip | 2: Cache whole env | 3: Cache some libs |
---|---|---|---|---|
py3.7 | 4m 14s | 1m 34s | 3m 15s | |
py3.8 | 5m 6s | 1m 50s | 3m 4s | |
py3.9 | 4m 25s | 4m 34s | 2m 25 | 2m 56s |
py3.10 | 4m 30s | 4m 41s | 1m 44s | 2m 51s |
py3.11 | 4m 42s | 5m 17s | 2m 42s | 2m 51s |
Average | 4m 35s | 4m 50s | 2m 3s | 3m 35s |
Existing approach, just pip-install with no caching.
Caches the wheels, but not the installed environment. As per the action docs.
As per this blog
Cache only the libraries which need to be built, such as pyspark. Leave other libs to be pip-installed as before
To decide which packages to cache: we want to save the most time, whilst keeping under ~2GB total cache size per env. Some calculations from experimentation, sorted by those that save the most time for the least space:
Package | size (MB) | built time (s) | s / MB |
---|---|---|---|
site-packages/pyspark* | 310 | 12s | 0.039 |
site-packages/nvidia* | 1521 | 40s | 0.026 |
site-packages/torch* | 619 | 13s | 0.021 |
site-packages/tensorflow* | 586 | 12s | 0.020 |
site-packages/xgboost* | 200 | 4s | 0.020 |
So, decide to cache just the first 3 libraries. In future if we drop support for any python versions, we can cache more libraries.
Implementing options on PR #84 .
I think there are a few areas for improvement in the GitHub test suite that we could address to improve the execution speed. Currently the unit tests take almost 20 minutes to run on CI. If we could reduce that it could help reduce the time it takes to validate PRs, improving our effectiveness as reviewers.
TODO
Slowest tests
[Updated] here are the current set of slowest tests: