Closed j34ni closed 2 months ago
Hi! This is the friendly automated conda-forge-linting service.
I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml
) and found some lint.
Here's what I've got...
For recipe/meta.yaml:
<two spaces>#<one space>[<expression>]
form. See lines [66, 84]Hi! This is the friendly automated conda-forge-linting service.
I Failed to even lint the recipe, probably because of a conda-smithy bug :cry:. This likely indicates a problem in your meta.yaml
, though. To get a traceback to help figure out what's going on, install conda-smithy and run conda smithy recipe-lint --conda-forge .
from the recipe directory.
Hi! This is the friendly automated conda-forge-linting service.
I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml
) and found it was in an excellent condition.
The build for ppc64le fails and I cannot understand why. Here is the error message, can you tell what went wrong?
2024-09-23T07:45:44.9748755Z configure: exit 1
2024-09-23T07:45:44.9748969Z + false
2024-09-23T07:45:44.9749221Z Traceback (most recent call last):
2024-09-23T07:45:44.9749819Z File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 1788, in bundle_conda
2024-09-23T07:45:44.9750151Z utils.check_call_env(
2024-09-23T07:45:44.9750603Z File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 404, in check_call_env
2024-09-23T07:45:44.9751061Z return _func_defaulting_env_to_os_environ("call", *popenargs, **kwargs)
2024-09-23T07:45:44.9752097Z File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 380, in _func_defaulting_env_to_os_environ
2024-09-23T07:45:44.9752714Z raise subprocess.CalledProcessError(proc.returncode, _args)
2024-09-23T07:45:44.9753500Z subprocess.CalledProcessError: Command '['/usr/bin/bash', '-e', '/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work/build-mpi.sh']' returned non-zero exit status 1.
2024-09-23T07:45:44.9753895Z
2024-09-23T07:45:44.9754263Z The above exception was the direct cause of the following exception:
2024-09-23T07:45:44.9754508Z
2024-09-23T07:45:44.9754834Z Traceback (most recent call last):
2024-09-23T07:45:44.9755303Z File "/opt/conda/bin/conda-build", line 11, in <module>
2024-09-23T07:45:44.9755685Z sys.exit(execute())
2024-09-23T07:45:44.9756229Z File "/opt/conda/lib/python3.10/site-packages/conda_build/cli/main_build.py", line 589, in execute
2024-09-23T07:45:44.9756639Z api.build(
2024-09-23T07:45:44.9757134Z File "/opt/conda/lib/python3.10/site-packages/conda_build/api.py", line 209, in build
2024-09-23T07:45:44.9757537Z return build_tree(
2024-09-23T07:45:44.9758072Z File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 3712, in build_tree
2024-09-23T07:45:44.9758497Z packages_from_this = build(
2024-09-23T07:45:44.9759044Z File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 2765, in build
2024-09-23T07:45:44.9759492Z newly_built_packages = bundlers[pkg_type](
2024-09-23T07:45:44.9760052Z File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 1795, in bundle_conda
2024-09-23T07:45:44.9760515Z raise BuildScriptException(str(exc), caused_by=exc) from exc
2024-09-23T07:45:44.9761301Z conda_build.exceptions.BuildScriptException: Command '['/usr/bin/bash', '-e', '/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work/build-mpi.sh']' returned non-zero exit status 1.
2024-09-23T07:45:52.0075268Z
2024-09-23T07:45:52.0477672Z ##[error]Bash exited with code '1'.
2024-09-23T07:45:52.0742477Z ##[section]Finishing: Run docker build
I spotted this in the log for the build with ppc64le:
configure:125428: checking for sn/mmtimer.h
configure:125428: powerpc64le-conda-linux-gnu-cc -c -DNDEBUG -mcpu=power8 -mtune=power8 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/work=/usr/local/src/conda/openmpi-5.0.5 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold=/usr/local/src/conda-prefix -isystem /usr/local/cuda/targets/ppc64le-linux/include -finline-functions -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1727077155962/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/include -isystem /usr/local/cuda/targets/ppc64le-linux/include conftest.c >&5
conftest.c:693:10: fatal error: sn/mmtimer.h: No such file or directory
693 | #include <sn/mmtimer.h>
| ^~~~~~~~~~~~~~
compilation terminated.
configure:125428: $? = 1
configure: failed program was:
| /* confdefs.h */
Could that be what makes the build fail?
Great work! This does seem to change ucx/ucc from optional to required dependencies. That means they shouldn't be present explicitly in run
or test
dependencies, they should either:
ignore_run_exports
, orignore_run_exports
(which puts the right pinning in run
).I think #102 is the motivation for making them optional, if we can keep going with that. Maybe with all of it working now, you can treat ucc the same as ucx and it will still work?
@leofeng can you look at this?
Thanks
I checked and it is not necessary to load UCX to use OpenMPI if there is no communications via interconnect involved (i.e., on a laptop or on a single node): it works without error or warning
However for use without CUDA support one has to do:
export OMPI_MCA_opal_warn_on_missing_libcuda=0
export OMPI_MCA_opal_cuda_support=0
export OMPI_MCA_mca_base_component_show_load_errors=0
Hi! This is the friendly automated conda-forge-linting service.
I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml
) and found some lint.
Here's what I've got...
For recipe/meta.yaml:
.ci_support
files and thus will not build any packages.@j34ni the last commit might have some problem. Rerendering failed and all CI configs were gone.
What happened to the .ci_support
file?
Can you restore them?
btw, before my review I see the following CI log. I'd like to point out two items:
pmix: internal
OpenFabrics OFI Libfabric: no (not found)
@j34ni it seems adding the additional packages to host (ex: pmix, libnuma, ...) is not doing what you want. Could you check if we should pass the build-time options differently so that Open MPI's build system can detect the new features that you need?
Open MPI configuration:
-----------------------
Version: 5.0.5
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Build MPI Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)
Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: external
libevent: external
Open UCC: yes
pmix: internal
PRRTE: internal
Threading Package: pthreads
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: yes
OpenFabrics OFI Libfabric: no (not found)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
Accelerators
-----------------------
CUDA support: yes
ROCm support: no
OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no
Can you restore them?
You can just revert the commit and retry.
@conda-forge-admin, please rerender
Hi! This is the friendly automated conda-forge-webservice.
I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.
This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11035718350.
File "/tmp/tmpnjif5hwu/openmpi-feedstock/recipe/meta.yaml", line 83, in top-level template code
# Ensure a consistent CUDA environment
^^^^^^^^^^^^^^^^^^^^^
jinja2.exceptions.UndefinedError: 'CONDA_BUILD_SYSROOT' is undefined
conda-{smithy,build} are being naughty lately... have anyone seen this error elsewhere?
@conda-forge-admin, please rerender
Hi! This is the friendly automated conda-forge-webservice.
I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.
This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11037264469.
Hi! This is the friendly automated conda-forge-linting service.
I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml
) and found it was in an excellent condition.
@leofang I think I messed up, sorry about that. I have reverted for now a couple of commits and will go again through the comments
I am trying to use the current build (with all the changes you suggested) on a HPC to run the OSU Micro Benchmarks 7.4 and there must be something I do not understand because it fails. The exact same thing worked with my previous build. How do you suggest to activate UCX with the current build?
conda-{smithy,build} are being naughty lately... have anyone seen this error elsewhere?
@leofang I think I messed up, sorry about that.
No, not your fault at all. The rerendering functionality seems to be down at least today, which affects both offline rerendering and CI services. The core team is investigating: https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/350#issuecomment-2375500712.
We need the rerendering capability to be back before we can rebuild the package for local testing, because rerendering would update the .ci_support
config to the newly added packages, e.g.
https://github.com/conda-forge/openmpi-feedstock/blob/954a28450bf11f01853995a3fd64f3b16e18b6ff/.ci_support/linux_64_c_compiler_version11cuda_compilernvcccuda_compiler_version11.8cxx_compiler_version11fortran_compiler_version11mpi_typeconda.yaml#L37-L38
and the config is then fed into conda-build.
The exact same thing worked with my previous build. How do you suggest to activate UCX with the current build?
The logic is reverted, so now you need a way to activate the UCX/UCC support if you need them. For UCX we've figured it out here: https://github.com/conda-forge/openmpi-feedstock/blob/main/recipe/post-link-ucx.sh Could you kindly take a look and see how to generalize it to UCX, please?
there must be something I do not understand because it fails.
Failed in what way?
I removed the unnecessary dependencies for the host and run
As far as I understand if OpenMPI is configured/built --with-ucx=... --with-ucc=...
then it will use UCX/UCC at run time if it is available, and will not complain if it is not used and not available.
As far as I understand if OpenMPI is configured/built
--with-ucx=... --with-ucc=...
then it will use UCX/UCC at run time if it is available, and will not complain if it is not used and not available.
We want advanced users who know exactly what they're doing to explicitly opt in CUDA/UCX/UCC supports. By removing the config file in commit 5f86ed3, we are saying: As long as ucx
/ucc
are installed they will be used by default, which is not what we did. Yes, we're still guarded by Open MPI's excellent dynamic loading mechanism, but we need an extra layer of protection and the opposite default.
The ask was how to generalize the MCA-based protection that we added for UCX to also support UCC, not to remove it 🙂
@leofang
UCC added support for the MPI_Scatter() and MPI_Iscatter() collective, and things like --mca coll_tuned_use_dynamic_rules
can be used to tune the collective component, but I can see nothing to turn it on/off
I guess that what you had for UCX could do for both then
(...) but I can see nothing to turn it on/off
I guess that what you had for UCX could do for both then
@j34ni could you try if --mca coll ucc
(enable) and --mca coll ^ucc
(disable) would make a difference? If so we should
conf
file to add ^ucc
(disable by default)A few more words: We rely on Open MPI's MCA mechanism to disable/enable a component. The ucc
component is an available option of the coll
framework. It is important that we understand the inner working even from the packaging perspective, because it affects how we build a package, set its default expectation, and enable advanced use cases.
I think I got it now and we can enable or disable UCC easily:
To disable it: export OMPI_MCA_coll_ucc_enable=0
To enable it: export OMPI_MCA_coll_ucc_enable=1
@conda-forge-admin, please rerender
Hi! This is the friendly automated conda-forge-webservice.
I tried to rerender for you but ran into some issues. Please check the output logs of the latest webservices GitHub actions workflow run for errors. You can also ping conda-forge/core for further assistance or you can try rerendering locally.
This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/11061278004.
The main reason ucx was excluded as a hard dependency was to save space because it resulted in hundreds of megabytes of extra packages due to cuda stuff. That's not true anymore since #119, so would it simplify things to make ucc/ucx a plain conda dependency? I don't entirely follow the reasoning behind ucx still being optional when it seems to only save ~20MB. The functionality is still optional, right?
@conda-forge-admin, please rerender
No, not your fault at all. The rerendering functionality seems to be down at least today, which affects both offline rerendering and CI services. The core team is investigating: conda-forge/conda-forge-ci-setup-feedstock#350 (comment).
To close the loop, it turns out the rerendering failed for another reason: https://github.com/conda-forge/openmpi-feedstock/pull/177#discussion_r1777665637 which is now fixed.
I don't entirely follow the reasoning behind ucx still being optional when it seems to only save ~20MB. The functionality is still optional, right?
My preference is if something is optional, we do not list them in run
. In addition to extra bandwidth and storage usage, adding more dependencies could also run into dependency resolving issues (ex: https://github.com/conda-forge/openmpi-feedstock/issues/153). Right now this PR presents a minimal change to the recipe and I like it that way. Maybe we can move this discussion to a separate issue and try to get this merged once the last question on FCFLAG
is resolved and the tests pass?
Makes sense!
Thanks, this LGTM now.
Doesn't need to happen here, but I think we can add actual tests that optional ucx works by including conda install ucc
in run_test.sh, rather than purely inspecting the config output:
# first, test without ucx
# ...
conda install ucc ucx
# enable ucx, ucc
mpiexec -n 2 some_test
# verify ucx/ucc was used
Thanks @j34ni and thanks @leofang for reviewing!
Thanks to you guys
From: Min RK @.> Sent: Friday, September 27, 2024 13:30 To: conda-forge/openmpi-feedstock @.> Cc: Jean Iaquinta @.>; Mention @.> Subject: Re: [conda-forge/openmpi-feedstock] Add UCC support (PR #175)
Thanks @j34nihttps://github.com/j34ni and thanks @leofanghttps://github.com/leofang for reviewing!
— Reply to this email directly, view it on GitHubhttps://github.com/conda-forge/openmpi-feedstock/pull/175#issuecomment-2379061148, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKRWC36DEH6JSXHVPVTLIWLZYU6UDAVCNFSM6AAAAABOT2FC7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZZGA3DCMJUHA. You are receiving this because you were mentioned.Message ID: @.***>
Thanks @j34ni @minrk @pentschev!
FYI as a follow-up I'm adding UCC to the global pinning https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/6484
Maybe we can move this discussion to a separate issue
See #181.
Summary
build_with_ucc="--with-ucc=$PREFIX"
)ompi_info
based testsChecklist
0
(if the version changed)conda-smithy
(Use the phrase code>@<space/conda-forge-admin, please rerender in a comment in this PR for automated rerendering)