conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
9 stars 22 forks source link

openmpi v5.0.1 #132

Closed regro-cf-autotick-bot closed 6 months ago

regro-cf-autotick-bot commented 6 months ago

It is very likely that the current package version for this feedstock is out of date.

Checklist before merging this PR:

Information about this PR:

  1. Feel free to push to the bot's branch to update this PR if needed.
  2. The bot will almost always only open one PR per version.
  3. The bot will stop issuing PRs if more than 3 version bump PRs generated by the bot are open. If you don't want to package a particular version please close the PR.
  4. If you want these PRs to be merged automatically, make an issue with code>@conda-forge-admin,</codeplease add bot automerge in the title and merge the resulting PR. This command will add our bot automerge feature to your feedstock.
  5. If this PR was opened in error or needs to be updated please add the bot-rerun label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase code>@<space/conda-forge-admin, please rerun bot in a PR comment to have the conda-forge-admin add it for you.

Closes: #131

Pending Dependency Version Updates

Here is a list of all the pending dependency version updates for this repo. Please double check all dependencies before merging.

Name Upstream Version Current Version
openmpi 5.0.1 Anaconda-Server Badge

Dependency Analysis

Please note that this analysis is highly experimental. The aim here is to make maintenance easier by inspecting the package's dependencies. Importantly this analysis does not support optional dependencies, please double check those before making changes. If you do not want hinting of this kind ever please add bot: inspection: false to your conda-forge.yml. If you encounter issues with this feature please ping the bot team conda-forge/bot.

Analysis by source code inspection shows a discrepancy between it and the the package's stated requirements in the meta.yaml.

Packages found by source code inspection but not in the meta.yaml:

This PR was created by the regro-cf-autotick-bot. The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. Feel free to drop us a line if there are any issues! This PR was generated by https://github.com/regro/cf-scripts/actions/runs/7294847426, please use this URL for debugging.

conda-forge-webservices[bot] commented 6 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

dalcinl commented 6 months ago

@leofang The libnl issue is gone. But now I'm seeing the missing libcuda warnings:

[fv-az1542-577:03743] mca_base_component_repository_open: unable to open mca_accelerator_cuda: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[fv-az1542-577:03743] mca_base_component_repository_open: unable to open mca_rcache_gpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[fv-az1542-577:03743] mca_base_component_repository_open: unable to open mca_rcache_rgpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)

@jsquyres We need your assistance again. I hoped the warnings above would have been silenced my setting

opal_warn_on_missing_libcuda = 0
opal_cuda_support = 0

in $PREFIX/etc/openmpi-mca-params.conf [link]. Has the configuration changed? Or is it somehow being ignored?

dalcinl commented 6 months ago

Additionally, the previous v5.0.0 builds are unusable in Circle CI

+ mpiexec -n 1 python -m coverage run -m mpi4py.bench --help
--------------------------------------------------------------------------
It looks like prte_init failed for some reason. There are many reasons
that can cause PRRTE to fail during prte_init, some of which are due to
configuration or environment problems.  This failure appears to be an
internal failure - here's some additional information (which may only
be relevant to a PRRTE developer):

  prte_plm_base_select failed
  --> Returned value  (-46) instead of PRTE_SUCCESS
--------------------------------------------------------------------------

I'm clueless about how to move forward.

jsquyres commented 6 months ago

@dalcinl Are these issues about Open MPI v5.0.0 or v5.0.1? Also, note that a bunch of people have disappeared for the holidays; we might not get some replies here until people return to the office in January.

First issue: warnings about libcuda

You'll need to set mca_base_component_show_load_errors to 0 to suppress this warning. I'm not sure what warnings you were suppressing with opal_warn_on_missing_libcuda=0.

@janjust @wenduwan Is this how you expected the CUDA-library-is-not-available-everywhere functionality to work? FWIW, I see the MCA var opal_warn_on_missing_libcuda registered, but I do not actually see it used anywhere in the code base.

Second issue: pml / CircleCI

"prte_plm_base_select failed" means that it couldn't find a launcher component to actually invoke your command. Can you add --mca plm_base_verbose 100 to the command line? This will show a bunch of output about the plm selection logic when you run.

wenduwan commented 6 months ago

The mca_base_component_repository_open warning is expected if Open MPI 5.0.0/1 is built --with-cuda but cuda is not available at runtime. This can be suppressed with mca_base_component_show_load_errors=0

I'm actually not sure what opal_warn_on_missing_libcuda does...

dalcinl commented 6 months ago

Are these issues about Open MPI v5.0.0 or v5.0.1?

They come from v5.0.0. We are trying to get a clean working build of v5.0.0 before moving to v5.0.1

I see the MCA var opal_warn_on_missing_libcuda registered, but I do not actually see it used anywhere in the code base.

That explains it all. This setting should do what it is supposed to do, or be removed for good, or at least emit a deprecation warning or something.

This can be suppressed with mca_base_component_show_load_errors=0

OK. This is really too generic, but we have to work with what we were given.

I'm actually not sure what opal_warn_on_missing_libcuda does...

Maybe in v4.1.x it was doing something, but then it got refactored out?

wenduwan commented 6 months ago

The opal warning is new in 5.0.0 wrt how Open MPI uses CUDA:

That explains it all. This setting should do what it is supposed to do, or be removed for good, or at least emit a deprecation warning or something.

I agree. We should do something about this.

dalcinl commented 6 months ago

@conda-forge-admin rerun bot

regro-cf-autotick-bot commented 6 months ago

Due to the bot-rerun label I'm closing this PR. I will make another one as appropriate. This was generated by https://github.com/regro/cf-scripts/actions/runs/7317361884