conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
10 stars 25 forks source link

Failure on systems without SSH #152

Closed folmos-at-orange closed 1 month ago

folmos-at-orange commented 7 months ago

Comment:

I tried to run an mpi application on a basic Rocky Linux 9 container with the conda-forge openmpi package. The application failed with the error message below. This error seems to be because the ssh client was not found.

Error message

--------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:

  plm_rsh_agent: ssh : rsh

Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------
[ba7c70732e0d:00625] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(335)
[ba7c70732e0d:00624] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716
[ba7c70732e0d:00624] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ba7c70732e0d:00624] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
dalcinl commented 7 months ago

I'm not really sure. The dependency actually just at runtime one, I've raised this issue to Open MPI folks, I don't remember the status.

The thing is, Open MPI does not strictly need ssh if running in a single compute node or workstation or laptop. Open MPI is eagerly looking for ssh and failing if not found, even if ssh is never going to be used.

For example, in my CI tests, if running under a small Docker image without ssh by default, I just do the following:

export OMPI_MCA_plm_ssh_agent=false

Here, ssh_agent means the path to the ssh command (it has nothing to do with the usual SSH agent related to ssh -A ). By setting the ssh command to false , Open MPI can safely continue running on a single node, and if an attempt is ever made to actually use ssh, as it is the false command, it will simply fail and you should somehow notice the issue.

In my own particular and biased opinion, I hate dependencies that are not strictly needed, and I'm generally not in favor of forcibly adding such dependencies as default ones. I'm totally fine with deps marked as optional or recommended, or mechanisms such pip install package[feature1,feature2], but I don't think conda support any of such mechanism. Or am I wrong?

Additionally, I'm a bit afraid of the consequences for users inadvertently installing the conda-forge openssh package in their conda environment, and from there any ssh command invocation will be the one from conda instead of the one from the system.

All that being said, if the rest of the Open MPI conda-forge user community feels that in this particular case it makes sense to add openssh as dependency of openmpi, you guys will not hear additional words from me objecting the decision. Of course, if any issue ever occurs because of such change, I'll simply roll my eyes with a sad smile, immediately unsubscribe from any issue/PR related to the problem, and happily keep going with my short and intranscendental human life.

folmos-at-orange commented 7 months ago

@dalcinl thanks for your quick answer.

I agree that adding openssh as a dependency is too much as it is strictly not needed. I wonder if the OpenMPI guys cannot fallback to single node in case of not finding ssh, mpich doesn't have this problem.

I'll take your workaround for my CI problems and as well for the execution of the packaged program I'm working. I was thinking adding openssh as a requirement for my package, but setting the env var is a better solution.

dalcinl commented 7 months ago

I wonder if the OpenMPI guys cannot fallback to single node in case of not finding ssh

That would be definitive solution, indeed. However, look here https://github.com/open-mpi/ompi/issues/12386, although I'm a bit confused about this comment: https://github.com/open-mpi/ompi/issues/12386#issuecomment-1978797613. After reading all of that issue, at this point I'm not really sure what's the actual intended behavior. Maybe you should ask for a clarification: Is the absence of ssh supposed to fail, even if running on a single node (or without an allocation)?

njzjz commented 6 months ago

export OMPI_MCA_plm_ssh_agent=false

We encountered the same error in conda-forge/ambertools-feedstock#133 when testing the conda package with OpenMPI. Is it recommended to set this environment variable during the test? The conda-forge documentation (https://conda-forge.org/docs/maintainer/knowledge_base/#message-passing-interface-mpi) may need to add this information.

dalcinl commented 6 months ago

@njzjz Whether to set the variable or not depends on whether you have the ssh command or not. If you are running on a minimal container environment without the ssh command, then you either install openssh with the package manager in the container, or you set the variable to workaround the issue.

PS: Regarding conda-forge documentation, I personally have no involvement with any of that. I guess you can raise an issue or submit a pull request with whatever clarification you consider appropriate.

leofang commented 6 months ago

PS: Regarding conda-forge documentation, I personally have no involvement with any of that. I guess you can raise an issue or submit a pull request with whatever clarification you consider appropriate.

Feel free to open an issue in https://github.com/conda-forge/conda-forge.github.io.

minrk commented 5 months ago

the openmpi package now sets appropriate environment variables when run under $CONDABUILD, so it works without ssh and recipes should not need to set any `OMPI` variables in their build/test scripts anymore.

I don't think we should do anything at runtime that's not openmpi's default behavior, so I'm not sure there's anything for us to do here. I also wish openmpi's default behavior were different, but I'm sure they have their reasons. It may be worth documenting that openmpi requires ssh by default and we don't depend on it since this should usually come from the system. The debian openmpi-bin depends on ssh-client, for example.

leofang commented 1 month ago

Sounds like we can close this issue now?