Closed germa89 closed 9 months ago
Related PR: https://github.com/ansys/pymapdl/pull/2514
I have being pointed that the libgomp
dependency is a "a redistributable of GCC compiler, so it is not an OS dependency but an executable dependency".
Checking the latest github runner ubuntu 22.04 OS image: https://github.com/actions/runner-images/blob/releases/ubuntu22/20231115/images/ubuntu/Ubuntu2204-Readme.md
and the one published at the beginning of October: https://github.com/actions/runner-images/blob/releases/ubuntu22/20231001/images/linux/Ubuntu2204-Readme.md
I see no difference in gcc
nor g++
, they both use:
Name | Version |
---|---|
g++ | 4:11.2.0-1ubuntu1 |
gcc | 4:11.2.0-1ubuntu1 |
I have seen however some differences:
Name | 20231115 | 20231001 |
---|---|---|
curl | 7.81.0-1ubuntu1.14 | 7.81.0-1ubuntu1.13 |
dnsutils | 1:9.18.18-0ubuntu0.22.04.1 | 1:9.18.12-0ubuntu0.22.04.3 |
libc6-dev | 2.35-0ubuntu3.4 | 2.35-0ubuntu3.3 |
libcurl4 | 7.81.0-1ubuntu1.14 | 7.81.0-1ubuntu1.13 |
libssl-dev | 3.0.2-0ubuntu1.12 | 3.0.2-0ubuntu1.10 |
locales | 2.35-0ubuntu3.4 | 2.35-0ubuntu3.3 |
xvfb | 2:21.1.4-2ubuntu1.7~22.04.2 | 2:21.1.4-2ubuntu1.7~22.04.1 |
But I can really relate any of those differences with the current issue.
I have been pointed to check the lib requirements using `ldd.
```bash [root@451ff7711f64 linx64]# ldd ansys.e ./ansys.e: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./ansys.e) ./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./ansys.e) ./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./ansys.e) libansBLAS.so => not found libmkl_core.so => not found libmkl_intel_lp64.so => not found libmkl_intel_thread.so => not found libifport.so.5 => not found libifcoremt.so.5 => not found libimf.so => not found libsvml.so => not found libirc.so => not found libiomp5.so => not found libhdf5.so.103 => not found libhdf5_cpp.so.103 => not found libhdf5_hl.so.100 => not found libhdf5_hl_cpp.so.100 => not found libACE.so.7.0.2 => not found libACEXML.so.7.0.2 => not found libACEXML_Parser.so.7.0.2 => not found libMapdlExceptionClient.so => not found libTAO.so.3.0.2 => not found libTAO_AnyTypeCode.so.3.0.2 => not found libTAO_BiDirGIOP.so.3.0.2 => not found libTAO_CodecFactory.so.3.0.2 => not found libTAO_PortableServer.so.3.0.2 => not found libz.so => not found libpng.so => not found libtiff.so => not found libjpeg.so => not found libboost_filesystem.so.1.71.0 => not found libboost_system.so.1.71.0 => not found libgmp.so.10 => /lib64/libgmp.so.10 (0x00007fffffb54000) libansGPU.so => not found libansuser.so => not found libansys.so => not found libansScaLAPACK.so => not found libansHDF.so => not found libansMemManager.so => not found libansMPI.so => not found libansysb.so => not found libansysx.so => not found libmnf.so => not found libansOpenMP.so => not found libansMETIS.so => not found libansParMETIS.so => not found libcadoe_algorithms.so => not found libCadoeInterpolation.so => not found libCadoeKernel.so => not found libCadoeLegacy.so => not found libCadoeMath.so => not found libCadoeReaders.so => not found libCadoeReadersExt.so => not found libcgns.so => not found libchap.so => not found libcif.so => not found libdsp.so => not found libansgil.so => not found libqhull.so => not found libansexb.so => not found libApipWrapper.so => not found liboctree-mesh.so => not found libansResourcePredict.so => not found libtg.so => not found libPrimeMesh.so => not found libansOpenSSL.so => not found libvtk.so => not found libspooles.so => not found libdmumps.so => not found libzmumps.so => not found libGL.so.1 => /lib64/libGL.so.1 (0x00007fffff8bc000) libGLU.so.1 => /lib64/libGLU.so.1 (0x00007fffff63b000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffff41f000) libm.so.6 => /lib64/libm.so.6 (0x00007fffff11d000) libXp.so.6 => /lib64/libXp.so.6 (0x00007ffffef13000) libXm.so.4 => /lib64/libXm.so.4 (0x00007ffffea40000) libXext.so.6 => /lib64/libXext.so.6 (0x00007ffffe82e000) libXi.so.6 => /lib64/libXi.so.6 (0x00007ffffe61d000) libXt.so.6 => /lib64/libXt.so.6 (0x00007ffffe3b6000) libX11.so.6 => /lib64/libX11.so.6 (0x00007ffffe078000) libSM.so.6 => /lib64/libSM.so.6 (0x00007ffffde6f000) libICE.so.6 => /lib64/libICE.so.6 (0x00007ffffdc53000) libXmu.so.6 => /lib64/libXmu.so.6 (0x00007ffffda38000) librt.so.1 => /lib64/librt.so.1 (0x00007ffffd82e000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ffffd526000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffffd310000) libintlc.so.5 => not found libc.so.6 => /lib64/libc.so.6 (0x00007ffffcf41000) libdl.so.2 => /lib64/libdl.so.2 (0x00007ffffcd3d000) libGLX.so.0 => /lib64/libGLX.so.0 (0x00007ffffcb0a000) libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00007ffffc854000) /lib64/ld-linux-x86-64.so.2 (0x00007fffffddc000) libXau.so.6 => /lib64/libXau.so.6 (0x00007ffffc650000) libXft.so.2 => /lib64/libXft.so.2 (0x00007ffffc439000) libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007ffffc1e4000) libpng15.so.15 => /lib64/libpng15.so.15 (0x00007ffffbfb9000) libxcb.so.1 => /lib64/libxcb.so.1 (0x00007ffffbd90000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007ffffbb8b000) libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007ffffb948000) libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007ffffb689000) libXrender.so.1 => /lib64/libXrender.so.1 (0x00007ffffb47e000) libz.so.1 => /lib64/libz.so.1 (0x00007ffffb267000) libexpat.so.1 => /lib64/libexpat.so.1 (0x00007ffffb03d000) libbz2.so.1 => /lib64/libbz2.so.1 (0x00007ffffae2d000) ```
It seems many libs are not found...
then I did:
[root@451ff7711f64 linx64]# ldd ansys.e | grep libgomp
./ansys.e: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./ansys.e)
./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./ansys.e)
./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./ansys.e)
I don't understand how grep shows that... I guess it is jst a echo printed by ldd?
Anyway, I dont really know yet what's going on... the docker immage seems to be using gcc 3.4, ... Maybe I should install that so it can be found?
how are you starting the container? Is there any change in the docker runtime?
I start the container with this:
https://github.com/ansys/pymapdl/blob/main/.ci/start_mapdl.sh
There was a new version of the docker runtime around end of October...
https://docs.docker.com/engine/release-notes/24.0/#2407
But I can't get anything from the changelog.....
I don't understand how grep shows that...
Maybe this goes to stderr, not stdout?
I don't understand how grep shows that...
Maybe this goes to stderr, not stdout?
I guess 🤷♂️
@germa89
this Error usually means that the executable have been build with a newer version of GCC than the one available on the machine (or near the executable).
That's why the version are not found , the executable "search" for those ABI version in the library but don't found them.
This tend to confirm one of that :
By the version searched it seems that ansys.e has been built with GCC 8 or 10, and so can't use libstdc++ from GCC 4.*
from internal investigations made by @dts12263
- the v21.2.0 container does NOT contain libansBLAS.so. Not sure how the test was previously passing
- the native v212 install does have ansblas, so it seems it was just not packaged for whatever reason
@germa89 : If v21.2.0 does not have libansBLAS.so
from the very beginning, why it was able to launch, but now it does not? Same goes for v212.
Hi @germa89, about the libansBlas.a file missing in the container I'm a bit surprised, but if it's the case we should discuss with MAPDL Devops Team to fix this. I've checked in my local dev distrib and it's part of the repo. The same way I was thinking MAPDL distrib does not rely on gcc libs already existing on the machine, but was providing its own gcc libs. That's also a question for Devops. Perhaps we have done too much optimizations making the smallest docker container for MAPDL..
Hi @germa89, about the libansBlas.a file missing in the container I'm a bit surprised, but if it's the case we should discuss with MAPDL Devops Team to fix this. I've checked in my local dev distrib and it's part of the repo. The same way I was thinking MAPDL distrib does not rely on gcc libs already existing on the machine, but was providing its own gcc libs. That's also a question for Devops. Perhaps we have done too much optimizations making the smallest docker container for MAPDL..
Hi @FredAns
The failing images were created in October 2021... and they have been working properly until beginning of this November. No changes in our side.
Can it be the github runners??
The failing images were created in October 2021... and they have been working properly until beginning of this November. No changes in our side.
Could it be loaded as part of a branch that only executes when specific hardware is present?
Ok I had a better look. In MAPDL, this libansBlas.a is just a wrapper to the Math library we need to use on a specific hardware. If you ldd this libansBlas.so library, it relies on the MKL ( Intel Processors) or BLIS (AMD Processors) Math Kernel libraries. In my repo, I can see a blas/ -> amd/libansBlas.so ->intel/libansBlas.so
At runtime we are suppose to pick the right one, depending on the machine we run on. Here are the dependencies of the Intel one:
on my machine these libansBlas.so are located here: /ansys_inc/v242/ansys/lib/linx64/blas/ Not sure if we have the same organization in the container
On the v221 container (I installed tree
manually)
[root@41421ddb9d79 /]# tree /ansys_inc/v221/ansys/lib/linx64/blas/
/ansys_inc/v221/ansys/lib/linx64/blas/
`-- intel
`-- libansBLAS.so
1 directory, 1 file
are we just missing the AMD variant?
@germa89 can you run a cat /proc/cpuinfo
on the runners?
@greschd good point @germa89 are we running on AMD platform ?
Done!
```output ##[debug]bash --noprofile --norc -e -o pipefail /__w/_temp/8f4c2a72-f352-446d-b539-8434950b88a0.sh processor : 0 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3243.425 cache size : 512 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: processor : 1 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3243.611 cache size : 512 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: processor : 2 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3242.860 cache size : 512 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: processor : 3 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3243.990 cache size : 512 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ```
They are AMD!!!
So I guess that the github runners have moved from intel to AMD? ... I could not find anything in internet regarding that change.
Probably the missing libgomp
dependency is also related to that?
To confirm this, you could try spinning up some cloud instance of each type (Intel / AMD), and try running the MAPDL docker image on both.
I would guess Github doesn't generally communicate which hardware Actions runs on, to avoid creating specific assumptions / expectations based on that.
To confirm this, you could try spinning up some cloud instance of each type (Intel / AMD), and try running the MAPDL docker image on both.
I haven't seen any option in github to choose AMD/Intel ... 🤷🏻♂️
It seems from this article that the machines used to be intel:
It is not confirmation, but it is something
More organizations are applying a DevOps thought-process and methodology to optimize software development. One of the main tools used in this process is a continuous integration (CI) tool, which automates the integration of code changes from multiple developers working on the same project.
@germa89 this Error usually means that the executable have been build with a newer version of GCC than the one available on the machine (or near the executable). That's why the version are not found , the executable "search" for those ABI version in the library but don't found them. This tend to confirm one of that :
- ansys.e has been recompiled with a newer version of GCC recently
- The GCC redistributable of the GCC compiler (libstdc++.so.6, libgomp ...) are not delivered anymore with ansys.e
- the GCC installed on the docker machine has regressed since the last time it did work (4 since a bit old)
By the version searched it seems that ansys.e has been built with GCC 8 or 10, and so can't use libstdc++ from GCC 4.*
@jomadec Not exactly, MAPDL does ship gcc 8, but not in the same executable location. The mapdl executable always runs under a wrapper script that sets LD_LIBRARY_PATH to the location of gcc runtime. This is what the landing zone concept by @jhdub23 is meant to solve.
I haven't seen any option in github to choose AMD/Intel
What I meant is to launch an AMD / Intel VM on \<cloud provider of choice>, not through Github Actions. If the same error occurs when launching the MAPDL container there, we can be fairly confident this is the underlying change that triggered these failures.
Of course you can also use a local machine, if you have an AMD one.
@dts12263 has been able to replicate the issue:
confirmed the 212 image runs on an intel machine but crashed on an AMD machine because of not having the AMD ansblas
Thank you for your input @greschd @FredAns @koubaa @jomadec and @dts12263 . We couldn't have figured out this without you!
[like] Frederic Thevenon reacted to your message:
From: German @.> Sent: Monday, December 4, 2023 10:07:31 AM To: ansys/pymapdl @.> Cc: Frederic Thevenon @.>; Mention @.> Subject: Re: [ansys/pymapdl] Scheduled runs fail (Issue #2520)
[External Sender]
@dts12263https://github.com/dts12263 has been able to replicate the issue:
confirmed the 212 image runs on an intel machine but crashed on an AMD machine because of not having the AMD ansblas
Thank you for your input @greschdhttps://github.com/greschd @FredAnshttps://github.com/FredAns @koubaahttps://github.com/koubaa @jomadechttps://github.com/jomadec and @dts12263https://github.com/dts12263 . We couldn't have workout this without you!
— Reply to this email directly, view it on GitHubhttps://github.com/ansys/pymapdl/issues/2520#issuecomment-1838217644, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDDKUVTSYRYFERKY5ZPEQLYHWOGHAVCNFSM6AAAAAA7U5X2MKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYGIYTONRUGQ. You are receiving this because you were mentioned.Message ID: @.***>
If anyone is interested, I recently got this information from running codespaces:
Processor Model: AMD EPYC 7763 64-Core Processor
Funny my codespace was using only 1 physical processor.
Problem
Suddenly some docker images are requiring a library (either
libgomp.so
orlibansBLAS.so
) to launch MAPDL. However, the docker images have not been changed in 9 months, and they have been working fine until now.Details
I first saw this error with the ubuntu docker images (which are old too, like 9 months). The
libgomp
issue on Ubuntu docker images was reported and fixed here: https://github.com/ansys/pymapdl/pull/2514 The solution was installinglibgomp
dependency during the job.But then, @clatapie realised it seems to also affect the older MAPDL docker images (<v23.1). Newer docker images are not affected because that library is installed already (@dts12263 for more info).
This issue has been running on since beginning of November (between 01 and 06 November), but I didn't realise until now.
Notes
I should notice that the ubuntu docker images are used to run the test from inside that container. Whereas the older docker images are most based on CentOS. We do run the tests in the GitHub runner OS (ubuntu) and connect to the running container with the Ansys product (CentOS).
Why this error now?
Definitely a container is not a 100% isolated environment from the host OS. They do share some dependencies (kernel?), so maybe the Github Runners do not have those dependencies anymore. I have tracked that there was new Github Runners images published at the end of October.
If it is a missing dependency on the runners, installing that dependency (it does not need to be
libgomp
, it might have another name) should fix it. However, I believelibansBLAS
is a custom ansys library, so we cannot just install it.It does not make sense at all!