conda-forge / openmpi-feedstock

A conda-smithy repository for openmpi.
BSD 3-Clause "New" or "Revised" License
9 stars 22 forks source link

Rebuild #142

Closed dalcinl closed 3 months ago

dalcinl commented 5 months ago

Checklist

conda-forge-webservices[bot] commented 5 months ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

leofang commented 5 months ago

@conda-forge-admin, please rerender

github-actions[bot] commented 5 months ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/7786526756.

leofang commented 5 months ago

@conda-forge/core We need some help here. We keep hitting the unicode error after merging #141 (the CI was green there, but the error started happening at main). Now we can reproduce the error even in the CI, so I can only assume this is due to some change in build tool that unfortunately intervened. I asked in the Gitter channel last week but didn't get any response so far...

2024-02-05T10:57:22.0104022Z Warning: rpath /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1707129643574/_build_env/lib is outside prefix /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1707129643574/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold (removing it)
2024-02-05T10:57:22.0576062Z Warning: rpath /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1707129643574/_build_env/lib is outside prefix /home/conda/feedstock_root/build_artifacts/openmpi-mpi_1707129643574/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold (removing it)
2024-02-05T10:57:22.1059741Z Traceback (most recent call last):
2024-02-05T10:57:22.1060663Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/os_utils/liefldd.py", line 54, in ensure_binary
2024-02-05T10:57:22.1067614Z     return lief.parse(str(file))
2024-02-05T10:57:22.1068672Z TypeError: '/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1707129643574/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/lib/\x01\udce4\x05'
2024-02-05T10:57:22.1069304Z 
2024-02-05T10:57:22.1069548Z During handling of the above exception, another exception occurred:
2024-02-05T10:57:22.1069815Z 
2024-02-05T10:57:22.1070014Z Traceback (most recent call last):
2024-02-05T10:57:22.1070354Z   File "/opt/conda/bin/conda-mambabuild", line 10, in <module>
2024-02-05T10:57:22.1070660Z     sys.exit(main())
2024-02-05T10:57:22.1071014Z   File "/opt/conda/lib/python3.10/site-packages/boa/cli/mambabuild.py", line 256, in main
2024-02-05T10:57:22.1077424Z     call_conda_build(action, config)
2024-02-05T10:57:22.1078241Z   File "/opt/conda/lib/python3.10/site-packages/boa/cli/mambabuild.py", line 228, in call_conda_build
2024-02-05T10:57:22.1078616Z     result = api.build(
2024-02-05T10:57:22.1079028Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/api.py", line 254, in build
2024-02-05T10:57:22.1084360Z     return build_tree(
2024-02-05T10:57:22.1084989Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 3789, in build_tree
2024-02-05T10:57:22.1097473Z     packages_from_this = build(
2024-02-05T10:57:22.1098008Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 2877, in build
2024-02-05T10:57:22.1104652Z     newly_built_packages = bundlers[pkg_type](output_d, m, env, stats)
2024-02-05T10:57:22.1105254Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 2004, in bundle_conda
2024-02-05T10:57:22.1109746Z     files = post_process_files(metadata, initial_files)
2024-02-05T10:57:22.1110373Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 1815, in post_process_files
2024-02-05T10:57:22.1114364Z     post_build(m, new_files, build_python=python)
2024-02-05T10:57:22.1114887Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/post.py", line 1818, in post_build
2024-02-05T10:57:22.1125728Z     post_process_shared_lib(m, f, prefix_files, host_prefix)
2024-02-05T10:57:22.1126377Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/post.py", line 1680, in post_process_shared_lib
2024-02-05T10:57:22.1130273Z     mk_relative_linux(
2024-02-05T10:57:22.1135257Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/post.py", line 612, in mk_relative_linux
2024-02-05T10:57:22.1135716Z     existing2, _, _ = get_rpaths_raw(elf)
2024-02-05T10:57:22.1136164Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/os_utils/liefldd.py", line 206, in get_rpathy_thing_raw_partial
2024-02-05T10:57:22.1136517Z     binary = ensure_binary(file)
2024-02-05T10:57:22.1136924Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/os_utils/liefldd.py", line 56, in ensure_binary
2024-02-05T10:57:22.1137298Z     print(f"WARNING: liefldd: failed to ensure_binary({file})")
2024-02-05T10:57:22.1137708Z UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 303: surrogates not allowed
2024-02-05T10:57:29.2571498Z 
2024-02-05T10:57:29.2638583Z ##[error]Bash exited with code '1'.
2024-02-05T10:57:29.2801376Z ##[section]Finishing: Run docker build
leofang commented 5 months ago

@dalcinl do you think any of the changes that we made could lead to this file being generated?

/home/conda/feedstock_root/build_artifacts/openmpi-mpi_1707129643574/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/lib/\x01\udce4\x05

which seems an odd one to me

dalcinl commented 5 months ago

@dalcinl do you think any of the changes that we made could lead to this file being generated?

I don't think so, but I'll double check our PR diff.

hmaarrfk commented 5 months ago

it might be a difference in the lief version... maybe compare and try to pin?

leofang commented 5 months ago

liblief/py-lief are both staying on 0.12.3, nothing has changed 🤔

hmaarrfk commented 5 months ago

sometimes i download both logs, delete everything until Z and just run vimdiff to visually inspect the differences

jakirkham commented 4 months ago

Looks like some builds are running out of space. Should we try adding to conda-forge.yml (and re-render)?

azure:
  free_disk_space: true
leofang commented 4 months ago

@conda-forge-admin, please rerender

leofang commented 4 months ago

No luck, out of disk space doesn't seem to be the issue of having a filename with weird unicode

jakirkham commented 4 months ago

Previously saw this error in this CI log

conda.CondaMultiError: Error with archive /home/conda/feedstock_root/build_artifacts/pkg_cache/libstdcxx-devel_linux-64-11.4.0-h922705a_105.conda.  You probably need to delete and re-download or re-create this file.  Message was:

failed with error: [Errno 28] No space left on device
Error with archive /home/conda/feedstock_root/build_artifacts/pkg_cache/gfortran_impl_linux-aarch64-11.4.0-hfbda5c0_5.conda.  You probably need to delete and re-download or re-create this file.  Message was:

failed with error: [Errno 28] No space left on device
Error with archive /home/conda/feedstock_root/build_artifacts/pkg_cache/gxx_impl_linux-aarch64-11.4.0-he533754_5.conda.  You probably need to delete and re-download or re-create this file.  Message was:

failed with error: [Errno 28] No space left on device
Error with archive /home/conda/feedstock_root/build_artifacts/pkg_cache/libgfortran5-13.2.0-ha4646dd_5.conda.  You probably need to delete and re-download or re-create this file.  Message was:

failed with error: [Errno 28] No space left on device
Error with archive /home/conda/feedstock_root/build_artifacts/pkg_cache/binutils_impl_linux-64-2.40-hf600244_0.conda.  You probably need to delete and re-download or re-create this file.  Message was:

failed with error: [Errno 28] No space left on device
Error with archive /home/conda/feedstock_root/build_artifacts/pkg_cache/libsanitizer-11.4.0-h4dcbe23_5.conda.  You probably need to delete and re-download or re-create this file.  Message was:

failed with error: [Errno 28] No space left on device
Error with archive /home/conda/feedstock_root/build_artifacts/pkg_cache/gcc_impl_linux-aarch64-11.4.0-he533754_5.conda.  You probably need to delete and re-download or re-create this file.  Message was:

failed with error: [Errno 28] No space left on device
leofang commented 4 months ago

Yeah we were not concerned with that, we've been focusing on fixing this issue: https://github.com/conda-forge/openmpi-feedstock/pull/142#issuecomment-1927240287, starting here: https://github.com/conda-forge/openmpi-feedstock/pull/141#issuecomment-1915451557.

leofang commented 4 months ago

@conda-forge-admin, please rerender

jakirkham commented 4 months ago

If that doesn't work, would suggest creating a diff of the dependencies installed when it was working and after it broke. That may shed some light on other relevant changes

Edit: For example, wouldn't be surprised if there was a buggy version of LIEF and we need to downgrade

Edit 2: It looks similar to issue ( https://github.com/lief-project/LIEF/issues/653 )

jakirkham commented 4 months ago

Maybe we could try switching to patchelf instead of LIEF (for example):

build:
  ...
  rpaths_patcher: patchelf
minrk commented 4 months ago

here's a diff of the build logs.

the only package difference between bad/good is in the host environment:

-    openssl:         3.2.1-hd590300_0            conda-forge
-    rdma-core:       50.0-hd3aeb46_0             conda-forge
+    openssl:         3.2.0-hd590300_1            conda-forge
+    rdma-core:       49.0-hd3aeb46_2             conda-forge

So first thing to try is probably to pin rdma-core to 49 (trying in this PR now)

while the root environment has small differences unlikely to affect things:

@@ -31,3 +31,3 @@
 conda-index               0.3.0              pyhd8ed1ab_1    conda-forge
-conda-libmamba-solver     24.1.0             pyhd8ed1ab_0    conda-forge
+conda-libmamba-solver     23.12.0            pyhd8ed1ab_0    conda-forge
 conda-oci-mirror          0.1.0              pyhd8ed1ab_0    conda-forge
@@ -103,3 +103,3 @@
 openjpeg                  2.5.0                h488ebb8_3    conda-forge
-openssl                   3.2.1                hd590300_0    conda-forge
+openssl                   3.2.0                hd590300_1    conda-forge
 oras-py                   0.1.14             pyhd8ed1ab_0    conda-forge
@@ -132,3 +132,3 @@
 python_abi                3.10                    4_cp310    conda-forge
-pytz                      2023.4             pyhd8ed1ab_0    conda-forge
+pytz                      2023.3.post1       pyhd8ed1ab_0    conda-forge
 pyyaml                    6.0.1           py310h2372a71_1    conda-forge

Other differences:

I don't really understand what could be causing that

minrk commented 4 months ago

demoting rdma-core didn't fix it. The LIEF check is unconditional if LIEF is available, so selecting patchelf doesn't prevent the use of LIEF. The only way to prevent the failing call is to remove LIEF.

Is there a way to make LIEF unavailable in the conda-build environment on the conda-forge builds? That would allow us to actually just use patchelf.

These two PRs would (each, separately) fix the issue, I think:

mbargull commented 4 months ago

The rdma-core=50 and rdma-core=49.1 builds are faulty in that ldconfig -vn "${PREFIX}/lib" creates symlinks for libmana/libmlx5 with those broken filenames. The SONAMEs of the libraries look alright, though. IDK, why it happens but I'm currently trying to replicate it.

mbargull commented 4 months ago

Seems to be an issue with conda-build>=3.28; still investigating...

mbargull commented 4 months ago

Seems to be an issue with conda-build>=3.28; still investigating...

There is a small-ish bug in conda-build>3.28 which lets it run patchelf not only for the actual binary but also its symlinks. In this case we have symlinks like libibverbs/libmana-rdmav34.so -> ../libmana.so.1.0.50.0 in rdma-core which, when used as patchelf's import, sets the rpath to for libmana.so.1.0.50.0 to $ORIGIN/.:$ORIGIN/.. instead of $ORIGIN/.. This is of course wrong, but shouldn't cause too much problems. Unfortunately, we then run into a patchelf bug which leads to ldconfig (called in downstream build/install scripts like here) creating additional symlinks with non-UTF-8 filenames. I'll write an issue, test and fix for conda-build and link it here later.

leofang commented 4 months ago

@conda-forge-admin, please rerender

github-actions[bot] commented 4 months ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/7966217037.

leofang commented 4 months ago

@conda-forge-admin, please restart ci

leofang commented 4 months ago

It seems @mbargull's PR https://github.com/conda/conda-build/pull/5181 was merged and available in conda-build 24.1.2, but we need to bump the pinned version in conda-forge-ci-setup to use it.

h-vetinari commented 4 months ago

but we need to bump the pinned version in conda-forge-ci-setup to use it.

seems we need to remove this line now that conda-build uses CalVer - care to send a PR?

leofang commented 4 months ago

seems we need to remove this line now that conda-build uses CalVer - care to send a PR?

Sure, see https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/304.

leofang commented 4 months ago

@conda-forge-admin, please restart ci

leofang commented 4 months ago

seems we need to remove this line now that conda-build uses CalVer - care to send a PR?

Sure, see conda-forge/conda-forge-ci-setup-feedstock#304.

It seems conda-build is still pinned at 3.28.4. @h-vetinari @beckermr any idea where did I miss to modify?

leofang commented 4 months ago

I guess the pinned conda-build might come from the docker images. Can any of you rebuild the images with the latest conda-build? It seems possible after @jakirkham's PR: https://github.com/conda-forge/docker-images/pull/230

h-vetinari commented 4 months ago

Have a look at this: https://github.com/conda-forge/openmpi-feedstock/blob/efdc204c50e34b01b95170d54debd6d97def2d1f/.scripts/build_steps.sh#L36-L39

Modifying this to require newest conda-build gives a long resolution error:

>mamba create -n test pip mamba conda-build=24 boa conda-forge-ci-setup=4

Paring this down a bit by pinning to the last boa version (that should by all appearances not be pinned) yields:

>mamba create -n test pip mamba conda-build=24 boa=0.16 conda-forge-ci-setup=4

Looking for: ['pip', 'mamba', 'conda-build=24', 'boa=0.16', 'conda-forge-ci-setup=4']

conda-forge/win-64                                          Using cache
conda-forge/noarch                                          Using cache
Could not solve for environment specs
The following packages are incompatible
├─ boa 0.16**  is installable and it requires
│  └─ conda-build >=3.24,<24.1.0a0 , which can be installed;
└─ conda-build 24**  is not installable because it conflicts with any installable versions previously reported.

Thus we find https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/pull/657 due to https://github.com/mamba-org/boa/issues/392. There's discussion about unblocking conda-build 24 in https://github.com/conda-forge/conda-smithy/pull/1844 already, and I just opened https://github.com/conda-forge/boa-feedstock/issues/79 for an alternative approach that's slightly less intrusive (IMO).

jakirkham commented 4 months ago

Wouldn't we also need to rebuild rdma-core first to produce packages with the fix for openmpi to use?

Admittedly I've not been following this as closely as others, so that could be totally off base

xhochy commented 4 months ago

@conda-forge-admin please rerender

leofang commented 4 months ago

@mbargull @h-vetinari any other thoughts? 😛 Now we are using the latest conda-build, but we still hit the same error...

jakirkham commented 4 months ago

Am curious what others think of my question above: https://github.com/conda-forge/openmpi-feedstock/pull/142#issuecomment-1955964849

Suspect that is an important next step

leofang commented 4 months ago

Am curious what others think of my question above: https://github.com/conda-forge/openmpi-feedstock/pull/142#issuecomment-1955964849

Based on @mbargull's earlier analysis, the CI failed because we had two unrelated bugs acting jointly:

  1. conda-build
  2. patchelf (calling ldconfig that created the buggy filenames, see also here)

I was hoping that with all the fixes we now eliminate Bug 1, thereby stopping the joint action of both bugs, as we are supposed to only have Bug 2 left and it should be relatively harmless. If my understanding is correct, then it means we still have a gap in this analysis.

h-vetinari commented 4 months ago

Wouldn't we also need to rebuild rdma-core first to produce packages with the fix for openmpi to use?

AFAIU, as long as that malformed (non-UTF-8) path is there - which is still the case - the rdma rebuild shouldn't make a difference. Unless we completely get rid of the symlinks in that package, but that's not a reasonable ask; we should IMO fix (or work around) the bug in patchelf/conda-build.

leofang commented 4 months ago

@mbargull any change you have further insight on this issue? 🙂

mbargull commented 3 months ago

@mbargull any change you have further insight on this issue? 🙂

Sorry, didn't follow this issue; looks like rdma-core has not been rebuilt yet. The libraries therein are broken for the versions built with conda-build >=3.28,<24.1.2, see my previous comment https://github.com/conda-forge/openmpi-feedstock/pull/142#issuecomment-1936503202 :

The rdma-core=50 and rdma-core=49.1 builds are faulty in that ldconfig -vn "${PREFIX}/lib" creates symlinks for libmana/libmlx5 with those broken filenames.

leofang commented 3 months ago

So was my understanding in https://github.com/conda-forge/openmpi-feedstock/pull/142#issuecomment-1965662764 incorrect, that there are still issues to fix, regardless whether we fixed conda-build (1)?

mbargull commented 3 months ago

So was my understanding in #142 (comment) incorrect, that there are still issues to fix, regardless whether we fixed conda-build (1)?

Yes, it is partially incorrect. patchelf does not call ldconfig. There's no "gap in this analysis"; it's simply that the combination of bugs in conda-build and patchelf lead to faulty artifacts (in rdma-core) which when processed downstream (this feedstock) with ldconfig causes the creation of symlinks with broken names.

xhochy commented 3 months ago

I have put in a PR to rebuild rdma_core: https://github.com/conda-forge/rdma-core-feedstock/pull/17

mbargull commented 3 months ago

@conda-forge-admin, please rerender

leofang commented 3 months ago

@conda-forge-admin, please rerender

github-actions[bot] commented 3 months ago

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/8234009795.

leofang commented 3 months ago

Yes, it is partially incorrect. patchelf does not call ldconfig. (...) which when processed downstream (this feedstock) with ldconfig causes the creation of symlinks with broken names.

@mbargull Forgive my ignorance, so when exactly is ldconfig called and by whom?

mbargull commented 3 months ago

@mbargull Forgive my ignorance, so when exactly is ldconfig called and by whom?

https://www.gnu.org/software/libtool/manual/libtool.html#Finish-mode (You should find the exact invocations in a configure scripts, but build's source archive might also carry a libtool.m4 which would be slightly more readable.)

leofang commented 3 months ago

linux-64 CI is finally happy!

leofang commented 3 months ago

@conda-forge-admin, please rerender