desihub / desimodules

Maintain the default set of DESI product versions and environment modules loaded.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Remove multiply-defined environment variables #52

Closed weaverba137 closed 3 months ago

weaverba137 commented 3 months ago

This PR closes #49.

Important question for @sbailey, @marcelo-alvarez, @akremin, @tskisner: we've decided to consolidate the definition of various NERSC-performance environment variables, e.g. KMP_INIT_AT_FORK. Was there a specific reason that the environment variables were defined in several different places in the module file, i.e., the definitions have to be made before at least some DESI modules are loaded?

marcelo-alvarez commented 3 months ago

This PR closes #49.

Important question for @sbailey, @marcelo-alvarez, @akremin, @tskisner: we've decided to consolidate the definition of various NERSC-performance environment variables, e.g. KMP_INIT_AT_FORK. Was there a specific reason that the environment variables were defined in several different places in the module file, i.e., the definitions have to be made before at least some DESI modules are loaded?

I can't remember if there was a specific reason to define the variables in several different places. However, I don't think the definitions have to be made before any DESI module is loaded, and don't anticipate any problems in the environment caused by defining them in the same place as you've done in this branch.

As far as I can tell, the comment

# work around
# https://docs.nersc.gov/development/languages/python/using-python-perlmutter/#mpi-issues

applies to the following lines

setenv CXI_FORK_SAFE 1
setenv CXI_FORK_SAFE_HP 1
module load evp-patch

so you probably want to group those together just below the comment.

weaverba137 commented 3 months ago

Thank you @marcelo-alvarez, I've actually subdivided those two by the subsection of the #mpi-issues link.

I also notice that we do

module unload cudatoolkit module load cudatoolkit/12.2

But 12.2 is the default, so why specifically is this done?

weaverba137 commented 3 months ago

PS, it would also be nice to have specific documentation for KMP_INIT_AT_FORK (does not appear at all in docs.nersc.gov) and MPI4PY_RC_RECV_MPROBE, which supposedly is described at https://docs.nersc.gov/current/#ongoing-issues, but in fact does not appear at all.

marcelo-alvarez commented 3 months ago

I agree it doesn't make sense as is. The module unload cudatoolkit is a remnant of when desiconda had its own cudatoolkit installed as a dependency, in order to avoid conflicts between that version and the NERSC system-wide version in the module.

@sbailey would know for sure, but it looks like https://github.com/desihub/desiconda/pull/68 removed the desiconda cudatoolkit, and so all that needs to be done in desimodules is to load the NERSC system-wide version via module load cudatoolkit (with no unloading), with the caveat that a future change to the default cudatoolkit module may break the environment for unforeseen reasons. Such a caveat applies for other default modules as well, so that would argue in favor of just using the default. I think either the module load cudatoolkit or module load cudatoolkit/12.2 would be fine.

As for KMP_INIT_AT_FORK things might break if you remove it, see e.g. https://github.com/desihub/nightwatch/issues/335#issuecomment-1555089934. It would be great if you could get to the bottom of what cases this is actually needed for and remove it if there are better solutions, but otherwise you might just want to leave it in, as inelegant as it is.

With regard to MPI4PY_RC_RECV_MPROBE='False' there was an ongoing issue at https://docs.nersc.gov/current/ circa August 2022 which was described there as follows:

Our slingshot 11 libfabric (Perlmutter CPU nodes) is currently missing the functions MPI_Mprobe()/MPI_Improbe() and MPI_Mrecv()/MPI_Imrecv(). This may especially impact mpi4py users attempting to send/receive pickled objects. One workaround may be to set export MPI4PY_RC_RECV_MPROBE='False'. The current estimated timeframe for a fix is January 2023. If these missing functions are impacting your workload, please open a ticket to let us know.

Given it is now July 2024, and there is no mention of this at the same page, you might assume it's fixed and remove it, though you may want to test it with a test-prod just to make sure. @sbailey should probably make this call.

sbailey commented 3 months ago

Thanks for digging up the history, @marcelo-alvarez . For the purposes of this cleanup update, let's aim for achieving the exact same environment as we currently have, since we know that it works with the current pipeline. i.e. this is about making the modules more maintainable in the future, not about changing the environment now (don't fix what isn't broken).

We probably don't need MPI4PY_RC_RECV_MPROBE='False' anymore, but it works as-is, it was put in place to solve an intermittent instability, and it would be difficult to prove that removing it doesn't re-introduce a rare crash. Let's not open that can of worms now.

weaverba137 commented 3 months ago

I'm OK with leaving the environment variables in place, but I'm still not sure about:

module unload cudatoolkit
module load cudatoolkit/12.2

I verified that cudatoolkit/12.2 is loaded by default when logging in to NERSC, without invoking any DESI environment.

weaverba137 commented 3 months ago

Another question:

This version is set as the "default" desiconda version before being overridden on perlmutter. However, this version does not exist at all. Should the default be a version that actually exists?

set desiconda_version 20211217-2.0.0
weaverba137 commented 3 months ago

Marking this as ready for review. I've added a few new questions above already.

weaverba137 commented 3 months ago

@sbailey, there's still the question about the default value of desiconda_version, but otherwise, I've applied all requested changes.

sbailey commented 3 months ago

From Slack #survey-ops discussions, online surveyops has been using $FIBER_ASSIGN_DIR for what we called $DESI_TILES here. I suggest that we switch to using $FIBER_ASSIGN_DIR here (and desihub/desispec#2294) to match that prior usage. @weaverba137 if you feel strongly that we should stick with $DESI_TILES, please document your thoughts here. I could be ok with that, but that doesn't necessarily imply that we would update surveyops to switch to $DESI_TILES instead. i.e. if we go with $DESI_TILES here, it might be better to live with the duplicate names rather than risk breaking operations even briefly for a cosmetic change.

weaverba137 commented 3 months ago

I don't have any issue with using FIBER_ASSIGN_DIR. I've already committed the change on this PR.

And oops, I forgot to push my previous commit!

weaverba137 commented 3 months ago

I've also changed to FIBER_ASSIGN_DIR in desihub/desispec#2294.

sbailey commented 3 months ago

I did some digging on desiconda_version: the current structure is mostly a leftover from supporting both cori and perlmutter simultaneously, i.e. doesn't apply now but could apply again in the future. At KPNO, we set DESI_PRODUCT_ROOT and "module load desiconda/some-version" ahead of time, so desiconda_version is never used. i.e. there is opportunity for conceptual cleanup here as well, but for now I've made the minimal change of making the default a version that exists (even if that default is never used), and adding some comments.

I re-verified that the environment is the same as current main plus $FIBER_ASSIGN_DIR, i.e. looks good. I will merge this now but will do separate testing of this plus desihub/desispec#2294 (and similar other repos) before merging those.

weaverba137 commented 3 months ago

@sbailey, thank you, I agree with your documentation on desiconda_version.