conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
9 stars 25 forks source link

Add UCC support #175

Closed j34ni closed 1 week ago

j34ni commented 1 week ago

Summary

Checklist

conda-forge-webservices[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found some lint.

Here's what I've got...

For recipe/meta.yaml:

github-actions[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-linting service.

I Failed to even lint the recipe, probably because of a conda-smithy bug :cry:. This likely indicates a problem in your meta.yaml, though. To get a traceback to help figure out what's going on, install conda-smithy and run conda smithy recipe-lint --conda-forge . from the recipe directory.

conda-forge-webservices[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

j34ni commented 1 week ago

The build for ppc64le fails and I cannot understand why. Here is the error message, can you tell what went wrong?

2024-09-23T07:45:44.9748755Z configure: exit 1
2024-09-23T07:45:44.9748969Z + false
2024-09-23T07:45:44.9749221Z Traceback (most recent call last):
2024-09-23T07:45:44.9749819Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 1788, in bundle_conda
2024-09-23T07:45:44.9750151Z     utils.check_call_env(
2024-09-23T07:45:44.9750603Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 404, in check_call_env
2024-09-23T07:45:44.9751061Z     return _func_defaulting_env_to_os_environ("call", *popenargs, **kwargs)
2024-09-23T07:45:44.9752097Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 380, in _func_defaulting_env_to_os_environ
2024-09-23T07:45:44.9752714Z     raise subprocess.CalledProcessError(proc.returncode, _args)
2024-09-23T07:45:44.9753500Z subprocess.CalledProcessError: Command '['/usr/bin/bash', '-e', '/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work/build-mpi.sh']' returned non-zero exit status 1.
2024-09-23T07:45:44.9753895Z 
2024-09-23T07:45:44.9754263Z The above exception was the direct cause of the following exception:
2024-09-23T07:45:44.9754508Z 
2024-09-23T07:45:44.9754834Z Traceback (most recent call last):
2024-09-23T07:45:44.9755303Z   File "/opt/conda/bin/conda-build", line 11, in <module>
2024-09-23T07:45:44.9755685Z     sys.exit(execute())
2024-09-23T07:45:44.9756229Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/cli/main_build.py", line 589, in execute
2024-09-23T07:45:44.9756639Z     api.build(
2024-09-23T07:45:44.9757134Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/api.py", line 209, in build
2024-09-23T07:45:44.9757537Z     return build_tree(
2024-09-23T07:45:44.9758072Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 3712, in build_tree
2024-09-23T07:45:44.9758497Z     packages_from_this = build(
2024-09-23T07:45:44.9759044Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 2765, in build
2024-09-23T07:45:44.9759492Z     newly_built_packages = bundlers[pkg_type](
2024-09-23T07:45:44.9760052Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 1795, in bundle_conda
2024-09-23T07:45:44.9760515Z     raise BuildScriptException(str(exc), caused_by=exc) from exc
2024-09-23T07:45:44.9761301Z conda_build.exceptions.BuildScriptException: Command '['/usr/bin/bash', '-e', '/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work/build-mpi.sh']' returned non-zero exit status 1.
2024-09-23T07:45:52.0075268Z 
2024-09-23T07:45:52.0477672Z ##[error]Bash exited with code '1'.
2024-09-23T07:45:52.0742477Z ##[section]Finishing: Run docker build
j34ni commented 1 week ago

I spotted this in the log for the build with ppc64le:

configure:125428: checking for sn/mmtimer.h
configure:125428: powerpc64le-conda-linux-gnu-cc -c -DNDEBUG -mcpu=power8 -mtune=power8 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work=/usr/local/src/conda/openmpi-5.0.5 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold=/usr/local/src/conda-prefix -isystem /usr/local/cuda/targets/ppc64le-linux/include -finline-functions -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/include -isystem /usr/local/cuda/targets/ppc64le-linux/include conftest.c >&5
conftest.c:693:10: fatal error: sn/mmtimer.h: No such file or directory
  693 | #include <sn/mmtimer.h>
      |          ^~~~~~~~~~~~~~
compilation terminated.
configure:125428: $? = 1
configure: failed program was:
| /* confdefs.h */

Could that be what makes the build fail?

minrk commented 1 week ago

Great work! This does seem to change ucx/ucc from optional to required dependencies. That means they shouldn't be present explicitly in run or test dependencies, they should either:

I think #102 is the motivation for making them optional, if we can keep going with that. Maybe with all of it working now, you can treat ucc the same as ucx and it will still work?

@leofeng can you look at this?

j34ni commented 1 week ago

Thanks

I checked and it is not necessary to load UCX to use OpenMPI if there is no communications via interconnect involved (i.e., on a laptop or on a single node): it works without error or warning

However for use without CUDA support one has to do:

export OMPI_MCA_opal_warn_on_missing_libcuda=0
export OMPI_MCA_opal_cuda_support=0
export OMPI_MCA_mca_base_component_show_load_errors=0
github-actions[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found some lint.

Here's what I've got...

For recipe/meta.yaml:

leofang commented 1 week ago

@j34ni the last commit might have some problem. Rerendering failed and all CI configs were gone.

j34ni commented 1 week ago

What happened to the .ci_support file?

j34ni commented 1 week ago

Can you restore them?

leofang commented 1 week ago

btw, before my review I see the following CI log. I'd like to point out two items:

@j34ni it seems adding the additional packages to host (ex: pmix, libnuma, ...) is not doing what you want. Could you check if we should pass the build-time options differently so that Open MPI's build system can detect the new features that you need?

Open MPI configuration:
-----------------------
Version: 5.0.5
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Build MPI Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: external
libevent: external
Open UCC: yes
pmix: internal
PRRTE: internal
Threading Package: pthreads

Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: yes
OpenFabrics OFI Libfabric: no (not found)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Accelerators
-----------------------
CUDA support: yes
ROCm support: no

OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no
leofang commented 1 week ago

Can you restore them?

You can just revert the commit and retry.

leofang commented 1 week ago

@conda-forge-admin, please rerender

github-actions[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11035718350.

leofang commented 1 week ago
  File "/tmp/tmpnjif5hwu/openmpi-feedstock/recipe/meta.yaml", line 83, in top-level template code
    # Ensure a consistent CUDA environment
^^^^^^^^^^^^^^^^^^^^^
jinja2.exceptions.UndefinedError: 'CONDA_BUILD_SYSROOT' is undefined

conda-{smithy,build} are being naughty lately... have anyone seen this error elsewhere?

leofang commented 1 week ago

@conda-forge-admin, please rerender

github-actions[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11037264469.

github-actions[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

j34ni commented 1 week ago

@leofang I think I messed up, sorry about that. I have reverted for now a couple of commits and will go again through the comments

j34ni commented 1 week ago

I am trying to use the current build (with all the changes you suggested) on a HPC to run the OSU Micro Benchmarks 7.4 and there must be something I do not understand because it fails. The exact same thing worked with my previous build. How do you suggest to activate UCX with the current build?

leofang commented 1 week ago

conda-{smithy,build} are being naughty lately... have anyone seen this error elsewhere?

@leofang I think I messed up, sorry about that.

No, not your fault at all. The rerendering functionality seems to be down at least today, which affects both offline rerendering and CI services. The core team is investigating: https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/350#issuecomment-2375500712.

We need the rerendering capability to be back before we can rebuild the package for local testing, because rerendering would update the .ci_support config to the newly added packages, e.g. https://github.com/conda-forge/openmpi-feedstock/blob/954a28450bf11f01853995a3fd64f3b16e18b6ff/.ci_support/linux_64_c_compiler_version11cuda_compilernvcccuda_compiler_version11.8cxx_compiler_version11fortran_compiler_version11mpi_typeconda.yaml#L37-L38 and the config is then fed into conda-build.

The exact same thing worked with my previous build. How do you suggest to activate UCX with the current build?

The logic is reverted, so now you need a way to activate the UCX/UCC support if you need them. For UCX we've figured it out here: https://github.com/conda-forge/openmpi-feedstock/blob/main/recipe/post-link-ucx.sh Could you kindly take a look and see how to generalize it to UCX, please?

leofang commented 1 week ago

there must be something I do not understand because it fails.

Failed in what way?

j34ni commented 1 week ago

I removed the unnecessary dependencies for the host and run

As far as I understand if OpenMPI is configured/built --with-ucx=... --with-ucc=... then it will use UCX/UCC at run time if it is available, and will not complain if it is not used and not available.

leofang commented 1 week ago

As far as I understand if OpenMPI is configured/built --with-ucx=... --with-ucc=... then it will use UCX/UCC at run time if it is available, and will not complain if it is not used and not available.

We want advanced users who know exactly what they're doing to explicitly opt in CUDA/UCX/UCC supports. By removing the config file in commit 5f86ed3, we are saying: As long as ucx/ucc are installed they will be used by default, which is not what we did. Yes, we're still guarded by Open MPI's excellent dynamic loading mechanism, but we need an extra layer of protection and the opposite default.

The ask was how to generalize the MCA-based protection that we added for UCX to also support UCC, not to remove it 🙂

j34ni commented 1 week ago

@leofang

UCC added support for the MPI_Scatter() and MPI_Iscatter() collective, and things like --mca coll_tuned_use_dynamic_rules can be used to tune the collective component, but I can see nothing to turn it on/off

I guess that what you had for UCX could do for both then

leofang commented 1 week ago

(...) but I can see nothing to turn it on/off

I guess that what you had for UCX could do for both then

@j34ni could you try if --mca coll ucc (enable) and --mca coll ^ucc (disable) would make a difference? If so we should

A few more words: We rely on Open MPI's MCA mechanism to disable/enable a component. The ucc component is an available option of the coll framework. It is important that we understand the inner working even from the packaging perspective, because it affects how we build a package, set its default expectation, and enable advanced use cases.

j34ni commented 1 week ago

I think I got it now and we can enable or disable UCC easily:

To disable it: export OMPI_MCA_coll_ucc_enable=0

To enable it: export OMPI_MCA_coll_ucc_enable=1

leofang commented 1 week ago

@conda-forge-admin, please rerender

github-actions[bot] commented 1 week ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11061278004.

minrk commented 1 week ago

The main reason ucx was excluded as a hard dependency was to save space because it resulted in hundreds of megabytes of extra packages due to cuda stuff. That's not true anymore since #119, so would it simplify things to make ucc/ucx a plain conda dependency? I don't entirely follow the reasoning behind ucx still being optional when it seems to only save ~20MB. The functionality is still optional, right?

leofang commented 1 week ago

@conda-forge-admin, please rerender

leofang commented 1 week ago

No, not your fault at all. The rerendering functionality seems to be down at least today, which affects both offline rerendering and CI services. The core team is investigating: conda-forge/conda-forge-ci-setup-feedstock#350 (comment).

To close the loop, it turns out the rerendering failed for another reason: https://github.com/conda-forge/openmpi-feedstock/pull/177#discussion_r1777665637 which is now fixed.

leofang commented 1 week ago

I don't entirely follow the reasoning behind ucx still being optional when it seems to only save ~20MB. The functionality is still optional, right?

My preference is if something is optional, we do not list them in run. In addition to extra bandwidth and storage usage, adding more dependencies could also run into dependency resolving issues (ex: https://github.com/conda-forge/openmpi-feedstock/issues/153). Right now this PR presents a minimal change to the recipe and I like it that way. Maybe we can move this discussion to a separate issue and try to get this merged once the last question on FCFLAG is resolved and the tests pass?

minrk commented 1 week ago

Makes sense!

minrk commented 1 week ago

Thanks, this LGTM now.

minrk commented 1 week ago

Doesn't need to happen here, but I think we can add actual tests that optional ucx works by including conda install ucc in run_test.sh, rather than purely inspecting the config output:

# first, test without ucx
# ...
conda install ucc ucx
# enable ucx, ucc
mpiexec -n 2 some_test
# verify ucx/ucc was used
minrk commented 1 week ago

Thanks @j34ni and thanks @leofang for reviewing!

j34ni commented 1 week ago

Thanks to you guys


From: Min RK @.> Sent: Friday, September 27, 2024 13:30 To: conda-forge/openmpi-feedstock @.> Cc: Jean Iaquinta @.>; Mention @.> Subject: Re: [conda-forge/openmpi-feedstock] Add UCC support (PR #175)

Thanks @j34nihttps://github.com/j34ni and thanks @leofanghttps://github.com/leofang for reviewing!

— Reply to this email directly, view it on GitHubhttps://github.com/conda-forge/openmpi-feedstock/pull/175#issuecomment-2379061148, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKRWC36DEH6JSXHVPVTLIWLZYU6UDAVCNFSM6AAAAABOT2FC7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZZGA3DCMJUHA. You are receiving this because you were mentioned.Message ID: @.***>

leofang commented 1 week ago

Thanks @j34ni @minrk @pentschev!

leofang commented 5 days ago

FYI as a follow-up I'm adding UCC to the global pinning https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/6484

leofang commented 4 days ago

Maybe we can move this discussion to a separate issue

See #181.