MilesCranmer / PySR

High-Performance Symbolic Regression in Python and Julia
https://ai.damtp.cam.ac.uk/pysr
Apache License 2.0
2.44k stars 217 forks source link

Apptainer definition file for PySR #687

Closed wkharold closed 3 months ago

wkharold commented 3 months ago

fix: 686

This pull request adds an Apptainer definition file to PySR. Researchers can use it to build an Apptainer container that packages PySR and its dependencies (via a conda environment).

MilesCranmer commented 3 months ago

Cool, thanks!

Since we have a Dockerfile already, could the Apptainer file be the same as that one, but in Apptainer syntax? I would prefer to avoid having two different images. See https://github.com/MilesCranmer/PySR/blob/master/Dockerfile

Also, would be nice to have some continuous integration tests like we have for the docker file.

wkharold commented 3 months ago

Since we have a Dockerfile already, could the Apptainer file be the same as that one, but in Apptainer syntax?

I had some trouble with that approach initially so I went with conda. Happily, what I learned doing the conda version allowed me to switch back to pip3 and make it more like the Dockerfile. I'll work on adding an integration test that mimics the Docker one.

MilesCranmer commented 3 months ago

Thanks, could it also pull from the python-bullseye image rather than rockylinux image? Since the base image is hosted by docker I don’t see a compelling reason for the base images to be different

wkharold commented 3 months ago

Switched to bullseye for compatibility with the Dockerfile. I was using Rocky because that's what SchedMD uses as the base for the Slurm images built for Google's Cluster Toolkit (formerly HPC Toolkit). One my goals is to be able to bind mount the necessary Slurm commands and libs into a container so that you can use the Julia ClusterManagers interface to do distributed SR from within the container. Maybe your already know how to do that with Docker?

coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 10345006057

Details


Totals Coverage Status
Change from base Build 10344709906: 0.0%
Covered Lines: 1135
Relevant Lines: 1211

đź’› - Coveralls
MilesCranmer commented 3 months ago

Switched to bullseye for compatibility with the Dockerfile. I was using Rocky because that's what SchedMD uses as the base for the Slurm images built for Google's Cluster Toolkit (formerly HPC Toolkit). One my goals is to be able to bind mount the necessary Slurm commands and libs into a container so that you can use the Julia ClusterManagers interface to do distributed SR from within the container. Maybe your already know how to do that with Docker?

I see, thanks. This is a good use-case. Since I have no particular need to use bullseye, so maybe we could switch the defaults of both the Dockerfile and the Apptainer to use Rocky? You can change the default here: https://github.com/MilesCranmer/PySR/blob/3aee19e38ceb3e0e1617d357a831400e01204658/Dockerfile#L6

MilesCranmer commented 3 months ago

One my goals is to be able to bind mount the necessary Slurm commands and libs into a container so that you can use the Julia ClusterManagers interface to do distributed SR from within the container.

I do Slurm stuff with PySR, but haven't done it from a container before. In principle it should be fine, I just haven't tried yet.

One other thing that is needed is a function to install all packages, even the Julia extensions like for ClusterManagers.jl. I just made a PR https://github.com/MilesCranmer/PySR/pull/688 to add this function. So if your apptainer file calls pysr.load_all_packages(), all the Julia packages ever needed by PySR (including ClusterManagers.jl) will be installed into the read-only environment.

MilesCranmer commented 3 months ago

Ok I merged https://github.com/MilesCranmer/PySR/pull/688 which will help for your stuff, do you want to merge those changes into your branch?

wkharold commented 3 months ago

I see, thanks. This is a good use-case. Since I have no particular need to use bullseye, so maybe we could switch the defaults of both the Dockerfile and the Apptainer to use Rocky? You can change the default here:

It doesn't look like there are Rocky based containers for Julia on Docker Hub. I think bullseye is OK for multicore work. Transitioning to Rocky would require significant changes and testing on Slurm to ensure that it worked properly.

MilesCranmer commented 3 months ago

Oh sorry I thought you meant you wanted to switch to Rocky. I'm happy with Bullseye too though. As long as both of the container types can have overall similar images.

wkharold commented 3 months ago

This is my first foray into Github CI testing. Would you prefer a modified CI_docker.yml file or a new CI_apptainer.yml file. My naive though is that the latter would be simpler but I could easily be missing something.

wkharold commented 3 months ago

Scanning the test failures it looks like most/all of them are the result of the tests being run in a read-only directory. To avoid a version module not found error I cd into /pysr (which is in the container's read-only file system) before doing anything involving pysr. Is there a python environment variable I could set to avoid having to do that?

MilesCranmer commented 3 months ago

Is there a way you can mount the directory it uses for testing to a writeable directory?

MilesCranmer commented 3 months ago

I tried this on my institutional cluster to see if I could use it locally, but it failed with:

> apptainer build pysr.sif Apptainer.def
INFO:    User not listed in /etc/subuid, trying root-mapped namespace
INFO:    fakeroot command not found
INFO:    Installing some packages may fail
FATAL:   Unable to build from Apptainer.def: while parsing definition: Apptainer.def: invalid section(s) specified: arguments

Is the arguments section new on 1.3.0` maybe? My institute looks to have 1.1.9. Maybe we could have compatibility for 1.1.0 (or even 1.0.0) to have broader support?

wkharold commented 3 months ago

Is the arguments section new on 1.3.0` maybe? My institute looks to have 1.1.9. Maybe we could have compatibility for 1.1.0 (or even 1.0.0) to have broader support?

It got introduced in 1.2 as part of the --build-arg / --build-arg-file feature. I don't have a problem with hardcoding the Julia/Python/distro bits to make it work with 1.x.

Lots of improvements have gone in since 1.1. You should encourage your institute sys admins to upgrade.

wkharold commented 3 months ago

I thought adding PYTHONPATH=/psyr had fixed the

ModuleNotFoundError: No module named 'pysr.version'

error. I no longer see it in my local builds or tests.

MilesCranmer commented 3 months ago

Lots of improvements have gone in since 1.1. You should encourage your institute sys admins to upgrade.

Good point. I asked them and they are planning on it in the few month timescale. I guess there will always be some clusters using older versions, so we might as well make it compatible (within reason). Maybe 1.1 is a good lower limit for now (until someone makes a feature request for even earlier)

MilesCranmer commented 3 months ago

The error about version.py missing indicates that it might not be getting installed to the directory we expect. Or maybe when we import pysr, it is importing from the local directory, rather than site-packages. Basically, version.py only gets created when one runs python setup.py install. (And it only gets put into the site-packages folder of PySR, rather than into the local directory)

wkharold commented 3 months ago

I looked in the container and version.py ends up in /pysr/pysr. I added that to PYTHONPATH but the test still fails. I've copied the command from the log and pasted it into my shell and it works. There must be something different about the github environment.

MilesCranmer commented 3 months ago

The GitHub actions logs show:

Installing collected packages: fastjsonschema, tornado, rpds-py, pyzmq, psutil, pluggy, platformdirs, nest-asyncio, iniconfig, debugpy, coverage, comm, attrs, referencing, pytest, jupyter-core, jupyter-client, jsonschema-specifications, jsonschema, ipykernel, nbformat, nbval
  WARNING: The script debugpy is installed in '/home/runner/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts coverage, coverage-3.12 and coverage3 are installed in '/home/runner/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts py.test and pytest are installed in '/home/runner/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts jupyter, jupyter-migrate and jupyter-troubleshoot are installed in '/home/runner/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts jupyter-kernel, jupyter-kernelspec and jupyter-run are installed in '/home/runner/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script jsonschema is installed in '/home/runner/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script jupyter-trust is installed in '/home/runner/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

which is odd, I thought it would install to paths within the image? Or are there other paths that need to be set before the pip installs?

e.g., maybe it just copies the definition of HOME from the host environment?

wkharold commented 3 months ago

Apptainer mounts the home directory of its initiator by default so those messages make sense but I didn't think about adding /home/runner/.local/bin to PATH. I'll add that and see if it changes things.

MilesCranmer commented 3 months ago

Oh I see. But shouldn’t it be installed inside the container rather than the user’s home directory (which thus wouldn’t be saved in the container)?

MilesCranmer commented 3 months ago

Maybe the Python pip install is trying to install to the home dir rather than an env?

MilesCranmer commented 3 months ago

Looks like we can find this with python -m site: https://stackoverflow.com/a/46071447

wkharold commented 3 months ago

I looked a little closer at my local environment and realized that I don't run the tests in the PySR directory. When I did that I reproduced the failure we're seeing, ugh. So I added steps to create a new directory specifically for testing in the CI action. I also experimented with the Apptainer %test section of the definition file and it seemed to simplify things so added that and changed CI_apptainer.yml to use it.