Dealing with dependencies in pyCSEP

wsavran commented 2 years ago

Working through the reproducibility packages and thinking about the testing experiments uncovered the need for a discussion on how we are managing the dependencies in pyCSEP.

Currently we are only pinning dependencies when a conflict or issue is known. Once the issue or conflict has been resolved we remove the pin.

Pros of this approach:

Provides most up-to-date versions of packages and dependencies
Plays much more nicely when users are trying to install pyCSEP alongside their working environment (this is the cast for most normal users)

Cons of this approach:

Environment can be transient with time which causes issues for reproducibility of results by simply choosing a version of pyCSEP
Deal with inevitable errors in pyCSEP from third-party incompatibilities (eg, new version of numpy removes function used by matplotlib)

Goals:

Enable reproducible research using the sofware
Provide users with ability to easily integrate pycsep into their working environment (ie although users should create their own environments we dont want to create dependency issues when they try and install pycsep).

Possible ways improve reproducibility of the computing environment:

Users could be responsible for providing a reproducible environment themselves eg, with reproducibility packages
Pin versions of dependencies within pycsep (see above) or use a min/max dependency approach
We could provide docker images associated with each build to freeze the high and low level dependencies (could be built into CI). adds other options of where to store and how to reproduce it exactly at anytime, etc).

References

pjm-usc commented 2 years ago

I believe it is important that each pyCSEP distribution provides list of package versions that work. The release can leave the packages unspecified, but the user should be able to find a complete list of the combination of versions that work. Maybe this list of versions can be given in a Dockerfile, or with an environment summary from the CI stack.

mherrmann3 commented 2 years ago

Too bad that PEP 665 got rejected. Luckily, there will be a 'take 2'. This discussion links to another approach, which unfortunately is also not implemented yet. So we'll have to wait or do it ourselves.

I had exactly the same thought as Philip: For every release, we (you) could provide a pip freeze / conda env export only for the packages that pycsep requires (maybe within a requirements_pinned.txt/yml). Not elegant, but perhaps sufficient - it's a backup solution in case of dependency issues.

wsavran commented 2 years ago

I like that idea as well. This should be done by CI during the release flow and should create and register a docker image of the build. What is the reason behind not pinning all dependencies and only the packages that pycsep requires?

mherrmann3 commented 2 years ago

Great.

What is the reason behind not pinning all dependencies and only the packages that pycsep requires?

I thought to keep the requirements more compact. But it's possibly a bad idea, since pycsep's direct dependencies (e.g., numpy) in turn may not guarantee reproducibility as we intend it. So yes, we'll have to report the versions of all packages.

wsavran commented 2 years ago

I ask because that was my first thought as well, and how I implemented things for the first iteration of the global experiment. There are caveats though. A pro of that solution is that it provides cross-platform support, because we cannot guarantee the exact same environment across different OS using an exact environment specification. Different binaries, etc. I think Docker provides a good solution for cross-platform support.

We could maybe provide both, a requirements.yml and an environment.yml where the former will provide pinned deps for pycsep and the latter will provide an exact environment that will run on Ubuntu.

mherrmann3 commented 2 years ago

A similar approach is also suggested in this article: Reproducible and upgradable Conda environments with conda-lock

Essentially:

we keep the environment.yml clean—with 'versioned direct' dependencies to be in control of upgrades;
to specify a reproducible environment ('transitively pinned'/'locked dependencies'), call conda env export > environment.lock.yml.

BUT, more interestingly, the article proposes a solution for the several technical difficulties with conda env export in step 2 (most importantly: the possible cross-platform inconsistencies, which we currently circumvent by using docker): conda-lock: basically, it defines a set of URLs to download (also speeding up installs). Nice: "you can specify which operating system you want to build the lock file for, so you can create a Linux lock file on other operating systems. By default it generates for Linux, macOS, and 64-bit Windows out of the box".

So we can create kind-of platform-specific environment.lock.ymls proxies (e.g., conda-linux-64.lock, conda-osx-64.lock, conda-win-64.lock), which may have the potential to completely abandon docker (or similar) for reproducibility packages. 🤞

Cool: the conda environment can be created directly from this lock file: conda create --name fromlock --file conda-linux-64.lock.

wsavran commented 2 years ago

I like that approach of creating platform specific lock files, we can just do that on release and then provide folks with a reproducible installation. I still think Docker is a solid tool for sharing environments, but this conda-lock is worth exploring. If we can provide some of the leg work in setting up Dockerfiles or at least a template, I'm pretty happy with the tool. There is also a tool called repo2docker provided by Jupyter that we can explore as well

pabloitu commented 2 years ago

I wonder if we could provide wheels for the required dependencies, which could also deal with cartopy/pygeos issues.

SCECcode / pycsep

Dealing with dependencies in pyCSEP #192