easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
152 stars 203 forks source link

`minimal-toolchains` looks at recipes and not at what is installed #2998

Open mboisson opened 5 years ago

mboisson commented 5 years ago

When using minimal-toolchains, EasyBuild looks for an EasyConfig, without validating that there is a corresponding installed module. For example, we have a recipe that depends on

    ('METIS', '5.1.0'),

EasyBuild finds that the git repository has a recipe

METIS-5.1.0-GCCcore-7.3.0.eb

This recipe is however not installed on our system. We have

METIS-5.1.0-GCC-7.3.0.eb

installed instead. This last recipe is compatible with the recipe that we are trying to install now (which uses gompi,2018.3.312 which has GCC 7.3.0 in it).

With --minimal-toolchains, it barfs on the following :

== Running parse hook for METIS-5.1.0-GCC-7.3.0.eb...
== Running parse hook for SCOTCH-6.0.6-gompi-2018.3.312.eb...
== processing EasyBuild easyconfig /cvmfs/soft.computecanada.ca/easybuild/easyconfigs/o/OpenFOAM/OpenFOAM-v1812-gompi-2018.3.312.eb
== building and installing avx2/MPI/gcc7.3/openmpi3.1/openfoam/v1812...
== fetching files [skipped]
== creating build dir, resetting environment...
== backup of existing module file stored at /cvmfs/soft.computecanada.ca/easybuild/modules/2017/avx2/MPI/gcc7.3/openmpi3.1/openfoam/v1812.bak_20190912155905_5432
== unpacking [skipped]
== patching [skipped]
== preparing...
== FAILED: Installation ended unsuccessfully (build directory: /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/MPI/gcc7.3/openmpi3.1/openfoam/v1812): build failed (first 300 chars): Missing modules for dependencies (use --robot?): Core/metis/5.1.0
== Results of the build can be found in the log file(s) /tmp/eb-9upFqW/easybuild-OpenFOAM-v1812-20190912.155905.JgRvr.log
ERROR: Build of /cvmfs/soft.computecanada.ca/easybuild/easyconfigs/o/OpenFOAM/OpenFOAM-v1812-gompi-2018.3.312.eb failed (err: 'build failed (first 300 chars): Missing modules for dependencies (use --robot?): Core/metis/5.1.0')

@bartoldeman

ocaisa commented 5 years ago

I think --use-existing-modules would solve this for you.

mboisson commented 5 years ago

That option indeed solves the issue. What other side effect does --use-existing-modules have though ?

ocaisa commented 5 years ago

None, it just means if you have an existing installation higher up the hierarchy it will prefer it.

It was created for zlib, which you would normally have at GCCcore but you also might want at iccifort level since it is highly tuned there.

mboisson commented 5 years ago

Does it still respect minimal toolchains within the installed modules ? Say there is a version with MPI and one without (for whatever reason).

ocaisa commented 5 years ago

In a hierarchy that's the same difference as GCCcore/iccifort so no, it would choose the MPI version (as long as the target software is using the MPI toolchain or higher). If there is some strange corner case where that's not appropriate, you could always explicitly indicate the required toolchain in the easyconfig.

mboisson commented 5 years ago

Ok. We can use use-existing-modules as a workaround, but I would argue this should be the default behavior, not an exception.

ocaisa commented 5 years ago

v4.0 is the time to make your case ;)

mboisson commented 5 years ago

@boegel, thoughts ?

mboisson commented 5 years ago

Without --use-existing-modules the --minimal-toolchains feature is broken as soon as a recipe that has a lower toolchain appears in the easyconfig repository, even if that recipe is not installed on the current system.

ocaisa commented 5 years ago

Actually, that doesn't sound right. It doesn't/shouldn't fail, it will just resolve to the GCCcore and then install it (if you use the robot). It's not true that once another recipe appears it will fail, in fact this is expected for use-existing-modules to do anything useful.

I see in your error it's looking for Core/metis/5.1.0, so with the system toolchain. That seems odd, what does a dry run look like, can that resolve?

mboisson commented 5 years ago

Correction. It will fail if you don't use --robot (we never use --robot).

mboisson commented 5 years ago

I guess my point is that "minimal toolchain" should not depend on existing recipes (i.e. in the repo), but rather on installed recipes if a match is found. Otherwise, if a new recipe is created at a the level of a lower toolchain, it will fail existing recipes until that new recipe is installed.

akesandgren commented 5 years ago

Is that what you want locally? Because here we definitely want it to find the minimal-toolchain recipe and build it... (using --robot of course)

ocaisa commented 5 years ago

The problem with setting use-existing-modules on by default is that it makes the order of commands matter for the end result. If something depends on zlib, which has easyconfigs for GCCcore and iccifort, I will get a different result with the same command depending on whether or not the iccifort zlib is installed or not.

For that reason I think the default as is is correct, and that it really is an "expert mode" option.

mboisson commented 5 years ago

@akesandgren yes. Here, we don't want to build software that is not needed. If there is already a match installed, it should get used. It should not install a new recipe that is "more minimal" just because somebody created a recipe for it.

@ocaisa, the problem with not having use-existing-modules is that the result will vary over time, based on what new recipe is added to the git repository. Once somebody adds METIS-5.1.0-GCCcore-7.3.0.eb there is no going back to using the (in my opinion) correct version METIS-5.1.0-GCC-7.3.0.eb and METIS-5.1.0-iccifort-2018.3.eb This is precisely what happened to me. A recipe that used to work now ended up not working because it suddenly found METIS-5.1.0-GCCcore-7.3.0.eb that did not exist in the past.

ocaisa commented 5 years ago

@mboisson True. That is also true without the use minimal-toolchains, if Metis at GCCcore existed first and afterwards we add Metis at GCC level, then the default resolution mechanism would prefer the GCC version even if it is not installed. The only way to avoid this would be to restrict your robot-paths to the installed easyconfigs (we do this at JSC). You can still add new easyconfigs to your search-paths (introduced in #2255).

mboisson commented 5 years ago

I would argue that when --robot is not used, it should by default resolve to what is installed, it should not consider what is possible ?

ocaisa commented 5 years ago

You could potentially do that but you would have to assume that what lives in your eb_repo is an accurate reflection of what is installed (or indicate the "golden" repo in some other way, which we do via controlling robot-paths)

mboisson commented 5 years ago

True. In our case, eb_repo is intended to represent what is installed, and it is pretty much the case (I'ld say with 99% accuracy).

boegel commented 5 years ago

I'm not convinced that we should change the default for --use-existing-modules, regardless of whether--minimal-toolchains` is used or not, since you could argue for both having it enabled and disabled by default, it depends on your expectations.

Note that also when --use-existing-modules is enabled, the result could vary over time, since installing additional modules high/low in the hierarchy could affect which exact modules are use to resolve a dependency.

There simply isn't a good default for this, but we have to pick a default. We picked to not have it enabled by default, so available easyconfig files overrule available module, and it's easy to flip that around if you want to. I don't think we can do much better...

mboisson commented 5 years ago

But don't you think that the current default is broken when you are not using --robot ? My expectation is that adding a new recipe in the repository should not break a recipe that once worked.

mboisson commented 5 years ago

Another of our staff ran into this issue today. An alternative solution si to say that METIS should not have been merged with GCCcore, and remove that recipe. GCCcore does not benefits from all optimizations (at least in our case, maybe it is not the case for others ?), hence it produces a METIS which is under-performing.