EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
24 stars 48 forks source link

Bot-specific `SitePackage.lua` that solves `libfabric` issues #531

Open bedroge opened 7 months ago

bedroge commented 7 months ago

With help from @casparvl, I've added the following to /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua on our AWS build cluster, which will be picked up by the bot for builds relying on libfabric:

require("strict")
local hook = require("Hook")

-- LmodMessage("Load bot-specific SitePackage.lua")

local function eessi_bot_libfabric_set_psm3_devices_hook(t)
    local simpleName = string.match(t.modFullName, "(.-)/")
    -- we may want to be more specific in the future, and only do this for specific versions of libfabric
    if simpleName == 'libfabric' then
        -- set environment variables PSM3_DEVICES as workaround for MPI applications hanging in libfabric's PSM3 provider
        -- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
        setenv('PSM3_DEVICES', 'self,shm')
    end
end

-- combine all load hook functions into a single one
function site_specific_load_hook(t)
    eessi_bot_libfabric_set_psm3_devices_hook(t)
end

local function combined_load_hook(t)
    -- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
    if eessi_load_hook ~= nil then
        eessi_load_hook(t)
    end
    site_specific_load_hook(t)
end

hook.register("load", combined_load_hook)

This solves the Haswell OpenMPI issues that we observed in several PRs. I was going to make a PR for it, but I have some doubts on how this should be done:

boegel commented 7 months ago
boegel commented 7 months ago

Same approach could be used for other problems that are triggered via libfabric, see https://github.com/easybuilders/easybuild-easyconfigs/issues/20233

ocaisa commented 6 months ago

@TopRichard also found an issue with our CUDA hook when trying to use it on NESSI, it will currently forbid the loading of dependency modules that have GPU support even for building purposes. Disabling that hook as part of the bot-specific SitePackage.lua seems like a good idea.