amusecode / amuse

Astrophysical Multipurpose Software Environment. This is the main repository for AMUSE
http://www.amusecode.org
Apache License 2.0
152 stars 98 forks source link

Build system needs replacing #1024

Open LourensVeen opened 7 months ago

LourensVeen commented 7 months ago

Issue

The current build system for AMUSE is based on setuptools' custom build commands feature, which is obsolete and in the process of being removed. To keep AMUSE from becoming impossible to build soon, we'll have to create a new build system. Additionally, I've been trying to build conda packages for AMUSE and keep running into issues with the build system there as well, so this should also be addressed.

The current build system comprises a global GNU autoconf-setup for detecting dependencies, a top-level Makefile which mostly just calls setup.py, which in turn uses a whole lot of custom Python code in support/. Community codes have an outer Makefile in src/amuse/community/<code>/ and usually another build system (the original one that came with the code) inside a src/ subdirectory. Some codes are downloaded at build time using a number of copies of a download.py script. Altogether, this builds the libraries in lib/, the community code workers, and installs the Python code for the framework and for the individual community codes.

Most of the actual work is done by the code in support/, which is rather complex and difficult to understand. Setuptools' inheritance-based architecture definitely doesn't help here either, but its deprecation means that that's about to be solved :smile:.

For all the above reasons, we need a new system.

Usage scenarios

There seem to be a couple of different usage scenarios for AMUSE in general, all of which need to be supported well by the build system.

Basic usage

In this scenario, the user installs AMUSE locally and uses it to run simulations without changing AMUSE itself. This typically occurs during organised courses, but also during student projects and even more advanced research projects that don't require large amounts of compute power. Often this is CPU-only, but if there is a GPU available then it should be usable. MPI may or may not be available.

For this scenario, AMUSE must be single-command installable, preferably using Conda as it is the most widely used suitable package manager, or by building from source if necessary. GNU/Linux, MacOS and WSL2 need to be supported and work out of the box (almost) every time.

HPC usage

For larger simulations, an HPC machine may be required. More and more of these machines (and the workstations in Leiden too) use EasyBuild to install software, so an EasyBuild configuration should probably be provided for this, and we need a build system that works correctly in an EasyBuild environment. Despite its name, EasyBuild is not very easy to use, and for an end user with no experience with EasyBuild building from source is probably easier, although there is the dependency issue to contend with.

Conda may be an option here, as it has some support for using the local MPI, but getting this to work may be trickier than doing a source install. These environments pretty much all run Linux, and more and more often have GPUs that should be usable when available. Some community codes can support non-CUDA GPUs, and these are becoming more common, so that should work too.

Community code development

In this scenario, the user installs AMUSE with the intent of using most of the system unchanged, except for a single community code which needs to be added, extended, improved or fixed. To be able to work efficiently, the user should need to run at most one command to recompile their code and make the new version available in a Python environment (virtualenv or conda), so that the changes can be tested quickly.

Note that the community code may be in the AMUSE repository or somewhere else (e.g. in OMUSE or in the upstream community code repository), but that the worker still needs to link with the libraries in the framework. This should also mix with package managers, so that you could e.g. create a conda environment, conda install amuse-framework into it, and with that active easily build and install the community code into the same environment.

Requirements

The requirements seem to be:

Design

The new build system should be driven by a top-level Makefile, which should reduce a basic install to a single command. Make will call pip where needed, rather than the other way around. The central Makefile can be used to build community codes, either all, certain subsets suitable to the available hardware, or individual ones. Individual community codes can also be built and installed using a Makefile in their directory. The central Makefile calls the Makefiles in the community code directories to build them, just like the Python build system currently does. Targets for community code Makefiles are currently somewhat standardised, this should be cleaned up and tightened a bit. If different versions of a worker can be built (e.g. CPU and GPU, or CUDA and OpenCL), then detecting the environment and building the appropriate workers should be done inside the community code build system, not by the central Makefile.

Autoconf seems to be working okay at the moment, so it would be good to keep, but we'd like for the community codes to have their own autoconf setup that requires only the dependencies of that particular code (https://github.com/amusecode/amuse/issues/577, https://github.com/amusecode/amuse/issues/583). Currently the resulting config.mk is stored in amuse, and then retrieved later by the build system. This is causing no end of problems when building separate packages so it needs to go, but I'm not sure yet that that information is only used for building the community codes, or also for other things. To be investigated.

The worker binaries should not be built by the Python package build; instead they're built first by make, which will then call pip to do the install of the Python bits, the worker, and anything else needed (e.g. data). Community codes should have their own pyproject.toml in the src/amuse/community/<code>/ directory, so that a Python package for only that code can be built from that directory. This also gets rid of the separate packages/ directory. The framework will have its own separate pyproject.toml.

I think amusifier will generate some files for you to start with when adding a new community code. I need to look into this, but probably those templates will have to be updated to match the new system, and possibly some files will have to be added (configure.in, pyproject.toml). Needs investigation.

LourensVeen commented 6 months ago

Update: I'm working on a prototype for this, and I've done some more thinking, so I have two more things.

First, the existing build system tries to build everything, then fails for some community codes and then assumes that that's okay because you don't have a GPU or something. This makes it a bit difficult to distinguish between expected failures and actual bugs. So it would be good if the central build system could inspect the environment and determine which community codes we expect to be able to build. Then we can try to build those, and if they fail, actually report that as an error. I actually have a mechanism in my prototype now in which the community code has a metadata file in which it specifies what it needs, and then the central build system can pick that up and decide what to build. I'm working on a central autoconf setup that detects things and integrates with this.

One question is still whether to consider an inability to build a particular community code due to a missing library to be an error or not. Or, if we're doing Conda, maybe we should just automatically install it if it's missing? That goes back to the installer idea. To be continued.

Second, for packaging, we have community codes that have a CPU worker and a GPU worker, and of course they also have a Python interface that you need whether you have a CPU or a GPU or not. The question is how to package this. We could put the interface and the CPU worker in a code_cpu package and the GPU worker in a code_gpu package that depends on the code_cpu package so we get the Python interface too. But now if we have a code that requires a GPU, then we'll have only code_gpu and it would have to include the Python interface, and now the whole thing becomes inconsistent.

So should there be three packages, code_interface, code_cpu and code_gpu, with appropriate dependencies? That's clean and consistent, but now we'll end up with a huge pile of packages. Having a single code package that contains both workers is not an option, because it would have to have a dependency on CUDA, and then you need to have CUDA installed even if you don't have a GPU. Maybe a galaxy worth of packages is the least bad option...

LourensVeen commented 6 months ago

I've been prototyping away on this for a bit, and I've put that in a temporary repository at https://github.com/LourensVeen/build-system-prototype for anyone who wants to look over my shoulder.

LourensVeen commented 5 months ago

There is now a branch for this that I'm pushing to.

Approximate to-do:

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 28 days if no further activity occurs. Thank you for your contributions.