conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
10 stars 25 forks source link

Add UCC support #175

Closed j34ni closed 2 months ago

j34ni commented 2 months ago

Summary

Checklist

conda-forge-webservices[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found some lint.

Here's what I've got...

For recipe/meta.yaml:

github-actions[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-linting service.

I Failed to even lint the recipe, probably because of a conda-smithy bug :cry:. This likely indicates a problem in your meta.yaml, though. To get a traceback to help figure out what's going on, install conda-smithy and run conda smithy recipe-lint --conda-forge . from the recipe directory.

conda-forge-webservices[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

j34ni commented 2 months ago

The build for ppc64le fails and I cannot understand why. Here is the error message, can you tell what went wrong?

2024-09-23T07:45:44.9748755Z configure: exit 1
2024-09-23T07:45:44.9748969Z + false
2024-09-23T07:45:44.9749221Z Traceback (most recent call last):
2024-09-23T07:45:44.9749819Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 1788, in bundle_conda
2024-09-23T07:45:44.9750151Z     utils.check_call_env(
2024-09-23T07:45:44.9750603Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 404, in check_call_env
2024-09-23T07:45:44.9751061Z     return _func_defaulting_env_to_os_environ("call", *popenargs, **kwargs)
2024-09-23T07:45:44.9752097Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 380, in _func_defaulting_env_to_os_environ
2024-09-23T07:45:44.9752714Z     raise subprocess.CalledProcessError(proc.returncode, _args)
2024-09-23T07:45:44.9753500Z subprocess.CalledProcessError: Command '['/usr/bin/bash', '-e', '/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work/build-mpi.sh']' returned non-zero exit status 1.
2024-09-23T07:45:44.9753895Z 
2024-09-23T07:45:44.9754263Z The above exception was the direct cause of the following exception:
2024-09-23T07:45:44.9754508Z 
2024-09-23T07:45:44.9754834Z Traceback (most recent call last):
2024-09-23T07:45:44.9755303Z   File "/opt/conda/bin/conda-build", line 11, in <module>
2024-09-23T07:45:44.9755685Z     sys.exit(execute())
2024-09-23T07:45:44.9756229Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/cli/main_build.py", line 589, in execute
2024-09-23T07:45:44.9756639Z     api.build(
2024-09-23T07:45:44.9757134Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/api.py", line 209, in build
2024-09-23T07:45:44.9757537Z     return build_tree(
2024-09-23T07:45:44.9758072Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 3712, in build_tree
2024-09-23T07:45:44.9758497Z     packages_from_this = build(
2024-09-23T07:45:44.9759044Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 2765, in build
2024-09-23T07:45:44.9759492Z     newly_built_packages = bundlers[pkg_type](
2024-09-23T07:45:44.9760052Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 1795, in bundle_conda
2024-09-23T07:45:44.9760515Z     raise BuildScriptException(str(exc), caused_by=exc) from exc
2024-09-23T07:45:44.9761301Z conda_build.exceptions.BuildScriptException: Command '['/usr/bin/bash', '-e', '/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work/build-mpi.sh']' returned non-zero exit status 1.
2024-09-23T07:45:52.0075268Z 
2024-09-23T07:45:52.0477672Z ##[error]Bash exited with code '1'.
2024-09-23T07:45:52.0742477Z ##[section]Finishing: Run docker build
j34ni commented 2 months ago

I spotted this in the log for the build with ppc64le:

configure:125428: checking for sn/mmtimer.h
configure:125428: powerpc64le-conda-linux-gnu-cc -c -DNDEBUG -mcpu=power8 -mtune=power8 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work=/usr/local/src/conda/openmpi-5.0.5 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold=/usr/local/src/conda-prefix -isystem /usr/local/cuda/targets/ppc64le-linux/include -finline-functions -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/include -isystem /usr/local/cuda/targets/ppc64le-linux/include conftest.c >&5
conftest.c:693:10: fatal error: sn/mmtimer.h: No such file or directory
  693 | #include <sn/mmtimer.h>
      |          ^~~~~~~~~~~~~~
compilation terminated.
configure:125428: $? = 1
configure: failed program was:
| /* confdefs.h */

Could that be what makes the build fail?

minrk commented 2 months ago

Great work! This does seem to change ucx/ucc from optional to required dependencies. That means they shouldn't be present explicitly in run or test dependencies, they should either:

I think #102 is the motivation for making them optional, if we can keep going with that. Maybe with all of it working now, you can treat ucc the same as ucx and it will still work?

@leofeng can you look at this?

j34ni commented 2 months ago

Thanks

I checked and it is not necessary to load UCX to use OpenMPI if there is no communications via interconnect involved (i.e., on a laptop or on a single node): it works without error or warning

However for use without CUDA support one has to do:

export OMPI_MCA_opal_warn_on_missing_libcuda=0
export OMPI_MCA_opal_cuda_support=0
export OMPI_MCA_mca_base_component_show_load_errors=0
github-actions[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found some lint.

Here's what I've got...

For recipe/meta.yaml:

leofang commented 2 months ago

@j34ni the last commit might have some problem. Rerendering failed and all CI configs were gone.

j34ni commented 2 months ago

What happened to the .ci_support file?

j34ni commented 2 months ago

Can you restore them?

leofang commented 2 months ago

btw, before my review I see the following CI log. I'd like to point out two items:

@j34ni it seems adding the additional packages to host (ex: pmix, libnuma, ...) is not doing what you want. Could you check if we should pass the build-time options differently so that Open MPI's build system can detect the new features that you need?

Open MPI configuration:
-----------------------
Version: 5.0.5
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Build MPI Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: external
libevent: external
Open UCC: yes
pmix: internal
PRRTE: internal
Threading Package: pthreads

Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: yes
OpenFabrics OFI Libfabric: no (not found)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Accelerators
-----------------------
CUDA support: yes
ROCm support: no

OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no
leofang commented 2 months ago

Can you restore them?

You can just revert the commit and retry.

leofang commented 2 months ago

@conda-forge-admin, please rerender

github-actions[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11035718350.

leofang commented 2 months ago
  File "/tmp/tmpnjif5hwu/openmpi-feedstock/recipe/meta.yaml", line 83, in top-level template code
    # Ensure a consistent CUDA environment
^^^^^^^^^^^^^^^^^^^^^
jinja2.exceptions.UndefinedError: 'CONDA_BUILD_SYSROOT' is undefined

conda-{smithy,build} are being naughty lately... have anyone seen this error elsewhere?

leofang commented 2 months ago

@conda-forge-admin, please rerender

github-actions[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11037264469.

github-actions[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

j34ni commented 2 months ago

@leofang I think I messed up, sorry about that. I have reverted for now a couple of commits and will go again through the comments

j34ni commented 2 months ago

I am trying to use the current build (with all the changes you suggested) on a HPC to run the OSU Micro Benchmarks 7.4 and there must be something I do not understand because it fails. The exact same thing worked with my previous build. How do you suggest to activate UCX with the current build?

leofang commented 2 months ago

conda-{smithy,build} are being naughty lately... have anyone seen this error elsewhere?

@leofang I think I messed up, sorry about that.

No, not your fault at all. The rerendering functionality seems to be down at least today, which affects both offline rerendering and CI services. The core team is investigating: https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/350#issuecomment-2375500712.

We need the rerendering capability to be back before we can rebuild the package for local testing, because rerendering would update the .ci_support config to the newly added packages, e.g. https://github.com/conda-forge/openmpi-feedstock/blob/954a28450bf11f01853995a3fd64f3b16e18b6ff/.ci_support/linux_64_c_compiler_version11cuda_compilernvcccuda_compiler_version11.8cxx_compiler_version11fortran_compiler_version11mpi_typeconda.yaml#L37-L38 and the config is then fed into conda-build.

The exact same thing worked with my previous build. How do you suggest to activate UCX with the current build?

The logic is reverted, so now you need a way to activate the UCX/UCC support if you need them. For UCX we've figured it out here: https://github.com/conda-forge/openmpi-feedstock/blob/main/recipe/post-link-ucx.sh Could you kindly take a look and see how to generalize it to UCX, please?

leofang commented 2 months ago

there must be something I do not understand because it fails.

Failed in what way?

j34ni commented 2 months ago

I removed the unnecessary dependencies for the host and run

As far as I understand if OpenMPI is configured/built --with-ucx=... --with-ucc=... then it will use UCX/UCC at run time if it is available, and will not complain if it is not used and not available.

leofang commented 2 months ago

As far as I understand if OpenMPI is configured/built --with-ucx=... --with-ucc=... then it will use UCX/UCC at run time if it is available, and will not complain if it is not used and not available.

We want advanced users who know exactly what they're doing to explicitly opt in CUDA/UCX/UCC supports. By removing the config file in commit 5f86ed3, we are saying: As long as ucx/ucc are installed they will be used by default, which is not what we did. Yes, we're still guarded by Open MPI's excellent dynamic loading mechanism, but we need an extra layer of protection and the opposite default.

The ask was how to generalize the MCA-based protection that we added for UCX to also support UCC, not to remove it 🙂

j34ni commented 2 months ago

@leofang

UCC added support for the MPI_Scatter() and MPI_Iscatter() collective, and things like --mca coll_tuned_use_dynamic_rules can be used to tune the collective component, but I can see nothing to turn it on/off

I guess that what you had for UCX could do for both then

leofang commented 2 months ago

(...) but I can see nothing to turn it on/off

I guess that what you had for UCX could do for both then

@j34ni could you try if --mca coll ucc (enable) and --mca coll ^ucc (disable) would make a difference? If so we should

A few more words: We rely on Open MPI's MCA mechanism to disable/enable a component. The ucc component is an available option of the coll framework. It is important that we understand the inner working even from the packaging perspective, because it affects how we build a package, set its default expectation, and enable advanced use cases.

j34ni commented 2 months ago

I think I got it now and we can enable or disable UCC easily:

To disable it: export OMPI_MCA_coll_ucc_enable=0

To enable it: export OMPI_MCA_coll_ucc_enable=1

leofang commented 2 months ago

@conda-forge-admin, please rerender

github-actions[bot] commented 2 months ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11061278004.

minrk commented 2 months ago

The main reason ucx was excluded as a hard dependency was to save space because it resulted in hundreds of megabytes of extra packages due to cuda stuff. That's not true anymore since #119, so would it simplify things to make ucc/ucx a plain conda dependency? I don't entirely follow the reasoning behind ucx still being optional when it seems to only save ~20MB. The functionality is still optional, right?

leofang commented 2 months ago

@conda-forge-admin, please rerender

leofang commented 2 months ago

No, not your fault at all. The rerendering functionality seems to be down at least today, which affects both offline rerendering and CI services. The core team is investigating: conda-forge/conda-forge-ci-setup-feedstock#350 (comment).

To close the loop, it turns out the rerendering failed for another reason: https://github.com/conda-forge/openmpi-feedstock/pull/177#discussion_r1777665637 which is now fixed.

leofang commented 2 months ago

I don't entirely follow the reasoning behind ucx still being optional when it seems to only save ~20MB. The functionality is still optional, right?

My preference is if something is optional, we do not list them in run. In addition to extra bandwidth and storage usage, adding more dependencies could also run into dependency resolving issues (ex: https://github.com/conda-forge/openmpi-feedstock/issues/153). Right now this PR presents a minimal change to the recipe and I like it that way. Maybe we can move this discussion to a separate issue and try to get this merged once the last question on FCFLAG is resolved and the tests pass?

minrk commented 2 months ago

Makes sense!

minrk commented 2 months ago

Thanks, this LGTM now.

minrk commented 2 months ago

Doesn't need to happen here, but I think we can add actual tests that optional ucx works by including conda install ucc in run_test.sh, rather than purely inspecting the config output:

# first, test without ucx
# ...
conda install ucc ucx
# enable ucx, ucc
mpiexec -n 2 some_test
# verify ucx/ucc was used
minrk commented 2 months ago

Thanks @j34ni and thanks @leofang for reviewing!

j34ni commented 2 months ago

Thanks to you guys


From: Min RK @.> Sent: Friday, September 27, 2024 13:30 To: conda-forge/openmpi-feedstock @.> Cc: Jean Iaquinta @.>; Mention @.> Subject: Re: [conda-forge/openmpi-feedstock] Add UCC support (PR #175)

Thanks @j34nihttps://github.com/j34ni and thanks @leofanghttps://github.com/leofang for reviewing!

— Reply to this email directly, view it on GitHubhttps://github.com/conda-forge/openmpi-feedstock/pull/175#issuecomment-2379061148, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKRWC36DEH6JSXHVPVTLIWLZYU6UDAVCNFSM6AAAAABOT2FC7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZZGA3DCMJUHA. You are receiving this because you were mentioned.Message ID: @.***>

leofang commented 2 months ago

Thanks @j34ni @minrk @pentschev!

leofang commented 1 month ago

FYI as a follow-up I'm adding UCC to the global pinning https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/6484

leofang commented 1 month ago

Maybe we can move this discussion to a separate issue

See #181.