conda-forge / msmpi-feedstock

A conda-smithy repository for msmpi.
BSD 3-Clause "New" or "Revised" License
1 stars 7 forks source link

Note for package maintainers: Don't test running msmpi in your recipe #2

Closed leofang closed 3 years ago

leofang commented 3 years ago

As title. Below I am summarizing and leaving a note for Conda-Forge downstream recipe maintainers on my struggle starting in https://github.com/conda-forge/mpi4py-feedstock/pull/33#issuecomment-755181656.

Do not test this in your Win64 builds:

mpiexec -n N your_executable ...

what you'll get is the following error:

(%PREFIX%) %SRC_DIR%>mpiexec -n 2 python mpi4py_simple_test.py 
ERROR: Failed to post close command error 1726
ERROR: unable to tear down the job tree. exiting...

If you run mpiexec with the full debug log on (mpiexec -d 3 ...) you'd see that a wrong PMP version is picked up:

...
[00:5320] successfully launched process 1
[00:5320] successfully launched process 0
[00:5320] root process launched, starting stdin redirection.
[01:2068] ERROR: Process Management Protocol version mismatch in connection
request: received version 3, expected 4.
[01:2068] ERROR: Process Management Protocol version mismatch in connection
request: received version 3, expected 4.
[01:2068] process_id=1 rank=0 refcount=2, waiting for the process to finish exiting.
[01:2068] creating an exit command for process id=1  rank=0, pid=9128, exit code=-1.
...

But the reason is currently unknown. The PMP version from MS MPI v10 (built in this feedstock) is 4.

My best theory is there is a hidden conflict between Conda-Forge's CI setup vs Azure's infrastructure. It could be that there's an smpd-like orchestrating daemon running in the background which got picked up to talk to mpiexec, so the MPI processes cannot pass the version check. (Ideally, both the smpd manager and all MPI processes would agree upon the PMP version.)

Note that, although not listed on the CI image (vs2017-win2016)'s software list, the image contains an outdated MS MPI (v7.x) in C:\Program Files\Microsoft MPI\. I tried force-removing it before building & testing, but it didn't fix the error. So for those who are interested in hunting down where the PMP version 3 comes from, you should first investigate if it's due to C:\Program Files\Microsoft MPI\Bin\smpd.exe being launched in the background when the CI starts. I am not familiar with Windows, so after confirming the build is correct I have 0 interest in digging deeper...

cc: @conda-forge/core for the record.

leofang commented 3 years ago

FYI: the closest (but still not directly relevant) discussion I can find is this: https://developercommunity.visualstudio.com/content/problem/446886/cannot-uninstall-microsoft-mpi-in-azure-pipeline-o.html

leofang commented 3 years ago

An alternative strategy to test the package before releasing it is to enable the build artifact to be stored by Azure via adding this config

azure:
   store_build_artifacts: true

to conda-forge.yml so that it can be downloaded and tested locally.

leofang commented 3 years ago

Hi @mariusvniekerk Matt told me you might share some insight with us on Azure's infrastructure 😅 Any chance you have a clue off top of your head for why we can't have a matching PMP version?

leofang commented 3 years ago

In #6 I added a hacky solution to this problem: Force-installing the MS-MPI Runtime to overwrite the existing one in the vm image: https://github.com/conda-forge/msmpi-feedstock/blob/4a424b89b0a79c928d9902c2082a6548ef86e1ca/recipe/bld.bat#L38-L44 All MPI tests can run afterwards, but this is just a very bad band-aid fix that only applies to the current repo and not to all downstream repos. However, this is the best I can do.

dalcinl commented 3 years ago

@leofang Just a clarification. I believe this issue does not affect Azure's vmImage: 'windows-2019'. I'm using that image for mpi4py's CI, I'm installing latest MS MPI from official installers, and I have no issues.

leofang commented 3 years ago

Trust me, @dalcinl. See all the recent failing attempts in https://github.com/conda-forge/msmpi-feedstock/pull/7. In mpi4py's CI, it's precisely what we do now in the msmpi and mpi4py conda packages: We force reinstall a fresh, new MS-MPI Runtime to overwrite the existing MS-MPI v7, so it works.

beckermr commented 3 years ago

@conda-forge/core do you all have any ideas here on how to help fix this?

leofang commented 3 years ago

Gitter discussion starts at https://gitter.im/conda-forge/conda-forge.github.io?at=608837fad24b29049788a9e1

isuruf commented 3 years ago

Can you show the full debug message?

leofang commented 3 years ago

@isuruf check the last failing CI output in #7.

isuruf commented 3 years ago

Do you have a link for a msmpi version that's built from source? The debug output is different from what I see on source posted in github.

leofang commented 3 years ago

Unfortunately Azure has removed all failing tests (note that this issue was created back in January). But I can assure you the SMP version mismatch (request: received version 3, expected 4) happens with both the source build and the binary installer.

beckermr commented 3 years ago

@leofang if you can open a new PR with the failure, that'd help others debug more easily.

leofang commented 3 years ago

I reopened #7 as a playground for you.

leofang commented 3 years ago

So #11 fixed multiple issues. It includes a patch -- which is the missing key that I didn't realize when I asked for help on Gitter -- to allow downstream packages to test msmpi out of box: https://github.com/conda-forge/msmpi-feedstock/blob/7fc2c26641837eb233879473ca3ee0dd27242edb/recipe/activate.bat#L5-L12 I just verified with mpi4py that it's working: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=310547&view=results This patch is only applied when building in the CI.

isuruf commented 3 years ago

Deleting of those DLLs should be moved to https://github.com/conda-forge/conda-forge-ci-setup-feedstock/blob/master/recipe/run_conda_forge_build_setup_win.bat#L53.

To avoid this issue on user systems, we should rename the msmpi.dll to msmpi-10.dll and fix msmpi.lib to use that name.