access-ci-org / ipf

The Information Publishing Framework for publishing HPC/HTC resource information to ACCESS-CI Information Services.
Apache License 2.0
0 stars 1 forks source link

Extmodules workflow misidentifies name/version in certain filesystem hierarchies #2

Open ericblau opened 1 year ago

ericblau commented 1 year ago

The Extmodules workflow assumes that modulefiles are stored in a directory structure where, at the leaf directories, the filename is the version, and the directory in which the file resides is the package name.

This is true in many cases, and irrelevant in many other cases, as IPF will override name/version from the filesystem if it can discover Name and Version key/value pairs inside the files. However, there are some files, at least on Expanse, such as the Bright Cluster Manager cuda Toolkit, where the file is at: /cm/shared/modulefiles/cuda11.7/toolkit/11.7.1

so IPF identifies the "Name" as "toolkit"

This could be ameliorated if comments could be added to the module file, including a Name key/value pair. But it would be good if IPF could figure this situation out itself.

One possibility would be a rework of how the Extmodules flow traverses subdirectories of the directories in the MODULEPATH. This may require a better understanding of all the ways MODULEPATHs and module file hierarchies are set up in practice.

Another possibility is to try to integrate lmod's "spider" command into the workflow, though I don't think we would be able to rely solely on spider, as it doesn't understand various key/value pairs that XSEDE/ACCESS have defined as extensions for more specific, detailed information. But it might be possible to get a list of modules from spider and match them up to their module files for extra info extraction.

tcooper commented 1 year ago

Some additional details:

This means IPF needs to be able to collect information about modulefiles in way that will support either Lmod or environment-modules usage on the resource. It should also support both Lua and TCL modulefiles and not necessarily assume either is being used exclusively regardless of which module environment is being used.

ericblau commented 9 months ago

I've developed and am currently testing what I think may be a solution to this problem.

Fundamentally, the extmodules workflow of IPF was failing to parse the filesystem hierarchy of TCL and/or Lua modules files in the same way that lmod does, leading to incorrect names/versions for modules.

The obvious way around this is to use lmod itself as the canonical source for name/version, instead of trying to infer from the filesystem hierarchy. However, the various lmod commands (module avail, spider, etc) didn't return all the needed information, nor did they return what info they did in a particularly useful format.

Then I stumbled upon lmod's Spider System Cache. One can use "update_lmod_system_cache_files" to create a lua cache file that contains all the info that spider knows about the modules on the system.

Thus, my path forward was to add a "lmod_cache_file" parameter to the ExtendedModApplicationsStep of IPF. IPF uses the lupa python package to execute the lua code from the lmod_cache_file, then converts the spiderT table to a python dict. It then goes through the (converted) spiderT dict looking for fileT or metaModuleT structures. It then goes through each fileT and for each module name in the fileT, adds the "fn" field as a module file that IPF should look at (along with the module name, and "Version" field). IPF then opens each module file, to check to see if there are keyword overrides for any fields, and publishes the module.

This all appears to work, and appears to solve the issue. Some care will have to be taken to ensure that the lmod_cache_file is created with the desired MODULEPATH, and is recreated with sufficient periodicity.

If anyone has feedback, specifically with regard to how I am interpreting the spiderT table, (or if it is likely to be substantively different for different versions of lmod), it is appreciated. So far, I have only run on Expanse, so there may well be variations I haven't encountered yet.

ericblau commented 9 months ago

Current code that I am testing is in the lmod_cache branch of this repo

tcooper commented 9 months ago

@ericblau Thanks for the update. I will share with our team.

ericblau commented 8 months ago

I believe that this issue is fixed, as of commit c31125e, and release 1.7.1. As a note, the lmod_cache code is not part of 1.7.1, because the default behavior (without using lmod_cache) now addresses the issue for both modules and lmod.

tcooper commented 8 months ago

Thanks for the update @ericblau. I'll let our team know.