Docker builds can get out of sync between users, likely because of caching pip packages during the build process. This should be resolvable by pinning the dependencies.
Which ones to pin? I think we want to pin anything we use directly, even if the dependency is technically satisfied through another package. The biggest example of this is pandas being implicitly installed via dependency on catalystcoop.pudl. I believe those should be made explicit in requirements.txt.
Below is a list of those implicitly installed but directly used packages. I compiled this list by running find . -name '*.py' -exec grep "import" {} \; | sort | uniq and manually looking for packages from outside the standard library and not already in requirements.txt.
requests
coloredlogs
numpy
pandas
sqlalchemy
By this method, the final package list for requirements.txt would have:
Docker builds can get out of sync between users, likely because of caching
pip
packages during the build process. This should be resolvable by pinning the dependencies.Which ones to pin? I think we want to pin anything we use directly, even if the dependency is technically satisfied through another package. The biggest example of this is
pandas
being implicitly installed via dependency oncatalystcoop.pudl
. I believe those should be made explicit in requirements.txt.Below is a list of those implicitly installed but directly used packages. I compiled this list by running
find . -name '*.py' -exec grep "import" {} \; | sort | uniq
and manually looking for packages from outside the standard library and not already in requirements.txt.By this method, the final package list for requirements.txt would have: