Open wsavran opened 2 years ago
I believe it is important that each pyCSEP distribution provides list of package versions that work. The release can leave the packages unspecified, but the user should be able to find a complete list of the combination of versions that work. Maybe this list of versions can be given in a Dockerfile, or with an environment summary from the CI stack.
Too bad that PEP 665 got rejected. Luckily, there will be a 'take 2'. This discussion links to another approach, which unfortunately is also not implemented yet. So we'll have to wait or do it ourselves.
I had exactly the same thought as Philip: For every release, we (you) could provide a pip freeze
/ conda env export
only for the packages that pycsep requires (maybe within a requirements_pinned.txt/yml
). Not elegant, but perhaps sufficient - it's a backup solution in case of dependency issues.
I like that idea as well. This should be done by CI during the release flow and should create and register a docker image of the build. What is the reason behind not pinning all dependencies and only the packages that pycsep requires?
Great.
What is the reason behind not pinning all dependencies and only the packages that pycsep requires?
I thought to keep the requirements
more compact. But it's possibly a bad idea, since pycsep's direct dependencies (e.g., numpy) in turn may not guarantee reproducibility as we intend it. So yes, we'll have to report the versions of all packages.
I ask because that was my first thought as well, and how I implemented things for the first iteration of the global experiment. There are caveats though. A pro of that solution is that it provides cross-platform support, because we cannot guarantee the exact same environment across different OS using an exact environment specification. Different binaries, etc. I think Docker provides a good solution for cross-platform support.
We could maybe provide both, a requirements.yml and an environment.yml where the former will provide pinned deps for pycsep and the latter will provide an exact environment that will run on Ubuntu.
A similar approach is also suggested in this article: Reproducible and upgradable Conda environments with conda-lock
Essentially:
environment.yml
clean—with 'versioned direct' dependencies to be in control of upgrades;conda env export > environment.lock.yml
.BUT, more interestingly, the article proposes a solution for the several technical difficulties with conda env export
in step 2 (most importantly: the possible cross-platform inconsistencies, which we currently circumvent by using docker): conda-lock
: basically, it defines a set of URLs to download (also speeding up installs). Nice: "you can specify which operating system you want to build the lock file for, so you can create a Linux lock file on other operating systems. By default it generates for Linux, macOS, and 64-bit Windows out of the box".
So we can create kind-of platform-specific environment.lock.yml
s proxies (e.g., conda-linux-64.lock
, conda-osx-64.lock
, conda-win-64.lock
), which may have the potential to completely abandon docker (or similar) for reproducibility packages. 🤞
Cool: the conda environment can be created directly from this lock file: conda create --name fromlock --file conda-linux-64.lock
.
I like that approach of creating platform specific lock files, we can just do that on release and then provide folks with a reproducible installation. I still think Docker is a solid tool for sharing environments, but this conda-lock is worth exploring. If we can provide some of the leg work in setting up Dockerfiles or at least a template, I'm pretty happy with the tool. There is also a tool called repo2docker
provided by Jupyter that we can explore as well
I wonder if we could provide wheels for the required dependencies, which could also deal with cartopy/pygeos issues.
Working through the reproducibility packages and thinking about the testing experiments uncovered the need for a discussion on how we are managing the dependencies in pyCSEP.
Currently we are only pinning dependencies when a conflict or issue is known. Once the issue or conflict has been resolved we remove the pin.
Pros of this approach:
Cons of this approach:
Goals:
Possible ways improve reproducibility of the computing environment:
References