ansys / pymapdl

Pythonic interface to MAPDL
https://mapdl.docs.pyansys.com
MIT License
423 stars 120 forks source link

Scheduled runs fail #2520

Closed germa89 closed 9 months ago

germa89 commented 10 months ago

Problem

Suddenly some docker images are requiring a library (either libgomp.so or libansBLAS.so) to launch MAPDL. However, the docker images have not been changed in 9 months, and they have been working fine until now.

Details

I first saw this error with the ubuntu docker images (which are old too, like 9 months). The libgomp issue on Ubuntu docker images was reported and fixed here: https://github.com/ansys/pymapdl/pull/2514 The solution was installing libgomp dependency during the job.

But then, @clatapie realised it seems to also affect the older MAPDL docker images (<v23.1). Newer docker images are not affected because that library is installed already (@dts12263 for more info).

This issue has been running on since beginning of November (between 01 and 06 November), but I didn't realise until now.

Notes

I should notice that the ubuntu docker images are used to run the test from inside that container. Whereas the older docker images are most based on CentOS. We do run the tests in the GitHub runner OS (ubuntu) and connect to the running container with the Ansys product (CentOS).

Why this error now?

Definitely a container is not a 100% isolated environment from the host OS. They do share some dependencies (kernel?), so maybe the Github Runners do not have those dependencies anymore. I have tracked that there was new Github Runners images published at the end of October.

If it is a missing dependency on the runners, installing that dependency (it does not need to be libgomp, it might have another name) should fix it. However, I believe libansBLAS is a custom ansys library, so we cannot just install it.

It does not make sense at all!

germa89 commented 10 months ago

Related PR: https://github.com/ansys/pymapdl/pull/2514

germa89 commented 10 months ago

I have being pointed that the libgomp dependency is a "a redistributable of GCC compiler, so it is not an OS dependency but an executable dependency".

Checking the latest github runner ubuntu 22.04 OS image: https://github.com/actions/runner-images/blob/releases/ubuntu22/20231115/images/ubuntu/Ubuntu2204-Readme.md

and the one published at the beginning of October: https://github.com/actions/runner-images/blob/releases/ubuntu22/20231001/images/linux/Ubuntu2204-Readme.md

I see no difference in gcc nor g++, they both use:

Name Version
g++ 4:11.2.0-1ubuntu1
gcc 4:11.2.0-1ubuntu1

I have seen however some differences:

Name 20231115 20231001
curl 7.81.0-1ubuntu1.14 7.81.0-1ubuntu1.13
dnsutils 1:9.18.18-0ubuntu0.22.04.1 1:9.18.12-0ubuntu0.22.04.3
libc6-dev 2.35-0ubuntu3.4 2.35-0ubuntu3.3
libcurl4 7.81.0-1ubuntu1.14 7.81.0-1ubuntu1.13
libssl-dev 3.0.2-0ubuntu1.12 3.0.2-0ubuntu1.10
locales 2.35-0ubuntu3.4 2.35-0ubuntu3.3
xvfb 2:21.1.4-2ubuntu1.7~22.04.2 2:21.1.4-2ubuntu1.7~22.04.1

But I can really relate any of those differences with the current issue.

germa89 commented 10 months ago

I have been pointed to check the lib requirements using `ldd.

Output details

```bash [root@451ff7711f64 linx64]# ldd ansys.e ./ansys.e: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./ansys.e) ./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./ansys.e) ./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./ansys.e) libansBLAS.so => not found libmkl_core.so => not found libmkl_intel_lp64.so => not found libmkl_intel_thread.so => not found libifport.so.5 => not found libifcoremt.so.5 => not found libimf.so => not found libsvml.so => not found libirc.so => not found libiomp5.so => not found libhdf5.so.103 => not found libhdf5_cpp.so.103 => not found libhdf5_hl.so.100 => not found libhdf5_hl_cpp.so.100 => not found libACE.so.7.0.2 => not found libACEXML.so.7.0.2 => not found libACEXML_Parser.so.7.0.2 => not found libMapdlExceptionClient.so => not found libTAO.so.3.0.2 => not found libTAO_AnyTypeCode.so.3.0.2 => not found libTAO_BiDirGIOP.so.3.0.2 => not found libTAO_CodecFactory.so.3.0.2 => not found libTAO_PortableServer.so.3.0.2 => not found libz.so => not found libpng.so => not found libtiff.so => not found libjpeg.so => not found libboost_filesystem.so.1.71.0 => not found libboost_system.so.1.71.0 => not found libgmp.so.10 => /lib64/libgmp.so.10 (0x00007fffffb54000) libansGPU.so => not found libansuser.so => not found libansys.so => not found libansScaLAPACK.so => not found libansHDF.so => not found libansMemManager.so => not found libansMPI.so => not found libansysb.so => not found libansysx.so => not found libmnf.so => not found libansOpenMP.so => not found libansMETIS.so => not found libansParMETIS.so => not found libcadoe_algorithms.so => not found libCadoeInterpolation.so => not found libCadoeKernel.so => not found libCadoeLegacy.so => not found libCadoeMath.so => not found libCadoeReaders.so => not found libCadoeReadersExt.so => not found libcgns.so => not found libchap.so => not found libcif.so => not found libdsp.so => not found libansgil.so => not found libqhull.so => not found libansexb.so => not found libApipWrapper.so => not found liboctree-mesh.so => not found libansResourcePredict.so => not found libtg.so => not found libPrimeMesh.so => not found libansOpenSSL.so => not found libvtk.so => not found libspooles.so => not found libdmumps.so => not found libzmumps.so => not found libGL.so.1 => /lib64/libGL.so.1 (0x00007fffff8bc000) libGLU.so.1 => /lib64/libGLU.so.1 (0x00007fffff63b000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffff41f000) libm.so.6 => /lib64/libm.so.6 (0x00007fffff11d000) libXp.so.6 => /lib64/libXp.so.6 (0x00007ffffef13000) libXm.so.4 => /lib64/libXm.so.4 (0x00007ffffea40000) libXext.so.6 => /lib64/libXext.so.6 (0x00007ffffe82e000) libXi.so.6 => /lib64/libXi.so.6 (0x00007ffffe61d000) libXt.so.6 => /lib64/libXt.so.6 (0x00007ffffe3b6000) libX11.so.6 => /lib64/libX11.so.6 (0x00007ffffe078000) libSM.so.6 => /lib64/libSM.so.6 (0x00007ffffde6f000) libICE.so.6 => /lib64/libICE.so.6 (0x00007ffffdc53000) libXmu.so.6 => /lib64/libXmu.so.6 (0x00007ffffda38000) librt.so.1 => /lib64/librt.so.1 (0x00007ffffd82e000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ffffd526000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffffd310000) libintlc.so.5 => not found libc.so.6 => /lib64/libc.so.6 (0x00007ffffcf41000) libdl.so.2 => /lib64/libdl.so.2 (0x00007ffffcd3d000) libGLX.so.0 => /lib64/libGLX.so.0 (0x00007ffffcb0a000) libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00007ffffc854000) /lib64/ld-linux-x86-64.so.2 (0x00007fffffddc000) libXau.so.6 => /lib64/libXau.so.6 (0x00007ffffc650000) libXft.so.2 => /lib64/libXft.so.2 (0x00007ffffc439000) libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007ffffc1e4000) libpng15.so.15 => /lib64/libpng15.so.15 (0x00007ffffbfb9000) libxcb.so.1 => /lib64/libxcb.so.1 (0x00007ffffbd90000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007ffffbb8b000) libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007ffffb948000) libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007ffffb689000) libXrender.so.1 => /lib64/libXrender.so.1 (0x00007ffffb47e000) libz.so.1 => /lib64/libz.so.1 (0x00007ffffb267000) libexpat.so.1 => /lib64/libexpat.so.1 (0x00007ffffb03d000) libbz2.so.1 => /lib64/libbz2.so.1 (0x00007ffffae2d000) ```

It seems many libs are not found...

then I did:

[root@451ff7711f64 linx64]# ldd ansys.e  | grep libgomp
./ansys.e: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./ansys.e)
./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./ansys.e)
./ansys.e: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./ansys.e)

I don't understand how grep shows that... I guess it is jst a echo printed by ldd?

Anyway, I dont really know yet what's going on... the docker immage seems to be using gcc 3.4, ... Maybe I should install that so it can be found?

how are you starting the container? Is there any change in the docker runtime?

I start the container with this:

https://github.com/ansys/pymapdl/blob/main/.ci/start_mapdl.sh

There was a new version of the docker runtime around end of October...

https://docs.docker.com/engine/release-notes/24.0/#2407

But I can't get anything from the changelog.....

greschd commented 10 months ago

I don't understand how grep shows that...

Maybe this goes to stderr, not stdout?

germa89 commented 10 months ago

I don't understand how grep shows that...

Maybe this goes to stderr, not stdout?

I guess 🤷‍♂️

jomadec commented 10 months ago

@germa89 image this Error usually means that the executable have been build with a newer version of GCC than the one available on the machine (or near the executable).
That's why the version are not found , the executable "search" for those ABI version in the library but don't found them. This tend to confirm one of that :

By the version searched it seems that ansys.e has been built with GCC 8 or 10, and so can't use libstdc++ from GCC 4.*

germa89 commented 9 months ago

from internal investigations made by @dts12263

  • the v21.2.0 container does NOT contain libansBLAS.so. Not sure how the test was previously passing
  • the native v212 install does have ansblas, so it seems it was just not packaged for whatever reason

@germa89 : If v21.2.0 does not have libansBLAS.so from the very beginning, why it was able to launch, but now it does not? Same goes for v212.

FredAns commented 9 months ago

Hi @germa89, about the libansBlas.a file missing in the container I'm a bit surprised, but if it's the case we should discuss with MAPDL Devops Team to fix this. I've checked in my local dev distrib and it's part of the repo. The same way I was thinking MAPDL distrib does not rely on gcc libs already existing on the machine, but was providing its own gcc libs. That's also a question for Devops. Perhaps we have done too much optimizations making the smallest docker container for MAPDL..

germa89 commented 9 months ago

Hi @germa89, about the libansBlas.a file missing in the container I'm a bit surprised, but if it's the case we should discuss with MAPDL Devops Team to fix this. I've checked in my local dev distrib and it's part of the repo. The same way I was thinking MAPDL distrib does not rely on gcc libs already existing on the machine, but was providing its own gcc libs. That's also a question for Devops. Perhaps we have done too much optimizations making the smallest docker container for MAPDL..

Hi @FredAns

The failing images were created in October 2021... and they have been working properly until beginning of this November. No changes in our side.

Can it be the github runners??

greschd commented 9 months ago

The failing images were created in October 2021... and they have been working properly until beginning of this November. No changes in our side.

Could it be loaded as part of a branch that only executes when specific hardware is present?

FredAns commented 9 months ago

Ok I had a better look. In MAPDL, this libansBlas.a is just a wrapper to the Math library we need to use on a specific hardware. If you ldd this libansBlas.so library, it relies on the MKL ( Intel Processors) or BLIS (AMD Processors) Math Kernel libraries. In my repo, I can see a blas/ -> amd/libansBlas.so ->intel/libansBlas.so

At runtime we are suppose to pick the right one, depending on the machine we run on. Here are the dependencies of the Intel one:

image

FredAns commented 9 months ago

on my machine these libansBlas.so are located here: /ansys_inc/v242/ansys/lib/linx64/blas/ Not sure if we have the same organization in the container

greschd commented 9 months ago

On the v221 container (I installed tree manually)

[root@41421ddb9d79 /]# tree /ansys_inc/v221/ansys/lib/linx64/blas/
/ansys_inc/v221/ansys/lib/linx64/blas/
`-- intel
    `-- libansBLAS.so

1 directory, 1 file

are we just missing the AMD variant?

greschd commented 9 months ago

@germa89 can you run a cat /proc/cpuinfo on the runners?

FredAns commented 9 months ago

@greschd good point @germa89 are we running on AMD platform ?

germa89 commented 9 months ago

Done!

Details

```output ##[debug]bash --noprofile --norc -e -o pipefail /__w/_temp/8f4c2a72-f352-446d-b539-8434950b88a0.sh processor : 0 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3243.425 cache size : 512 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: processor : 1 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3243.611 cache size : 512 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: processor : 2 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3242.860 cache size : 512 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: processor : 3 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3243.990 cache size : 512 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.85 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ```

They are AMD!!!

germa89 commented 9 months ago

So I guess that the github runners have moved from intel to AMD? ... I could not find anything in internet regarding that change.

Probably the missing libgomp dependency is also related to that?

greschd commented 9 months ago

To confirm this, you could try spinning up some cloud instance of each type (Intel / AMD), and try running the MAPDL docker image on both.

greschd commented 9 months ago

I would guess Github doesn't generally communicate which hardware Actions runs on, to avoid creating specific assumptions / expectations based on that.

germa89 commented 9 months ago

To confirm this, you could try spinning up some cloud instance of each type (Intel / AMD), and try running the MAPDL docker image on both.

I haven't seen any option in github to choose AMD/Intel ... 🤷🏻‍♂️

germa89 commented 9 months ago

It seems from this article that the machines used to be intel:

https://www.trendmicro.com/vinfo/us/security/news/cybercrime-and-digital-threats/github-action-runners-analyzing-the-environment-and-security-in-action#C05

It is not confirmation, but it is something

GitHub Action Runners: Analyzing the Environment and Security in Action - Security News
More organizations are applying a DevOps thought-process and methodology to optimize software development. One of the main tools used in this process is a continuous integration (CI) tool, which automates the integration of code changes from multiple developers working on the same project.
koubaa commented 9 months ago

@germa89 image this Error usually means that the executable have been build with a newer version of GCC than the one available on the machine (or near the executable). That's why the version are not found , the executable "search" for those ABI version in the library but don't found them. This tend to confirm one of that :

  • ansys.e has been recompiled with a newer version of GCC recently
  • The GCC redistributable of the GCC compiler (libstdc++.so.6, libgomp ...) are not delivered anymore with ansys.e
  • the GCC installed on the docker machine has regressed since the last time it did work (4 since a bit old)

By the version searched it seems that ansys.e has been built with GCC 8 or 10, and so can't use libstdc++ from GCC 4.*

@jomadec Not exactly, MAPDL does ship gcc 8, but not in the same executable location. The mapdl executable always runs under a wrapper script that sets LD_LIBRARY_PATH to the location of gcc runtime. This is what the landing zone concept by @jhdub23 is meant to solve.

greschd commented 9 months ago

I haven't seen any option in github to choose AMD/Intel

What I meant is to launch an AMD / Intel VM on \<cloud provider of choice>, not through Github Actions. If the same error occurs when launching the MAPDL container there, we can be fairly confident this is the underlying change that triggered these failures.

Of course you can also use a local machine, if you have an AMD one.

germa89 commented 9 months ago

@dts12263 has been able to replicate the issue:

confirmed the 212 image runs on an intel machine but crashed on an AMD machine because of not having the AMD ansblas

Thank you for your input @greschd @FredAns @koubaa @jomadec and @dts12263 . We couldn't have figured out this without you!

FredAns commented 9 months ago

[like] Frederic Thevenon reacted to your message:


From: German @.> Sent: Monday, December 4, 2023 10:07:31 AM To: ansys/pymapdl @.> Cc: Frederic Thevenon @.>; Mention @.> Subject: Re: [ansys/pymapdl] Scheduled runs fail (Issue #2520)

[External Sender]

@dts12263https://github.com/dts12263 has been able to replicate the issue:

confirmed the 212 image runs on an intel machine but crashed on an AMD machine because of not having the AMD ansblas

Thank you for your input @greschdhttps://github.com/greschd @FredAnshttps://github.com/FredAns @koubaahttps://github.com/koubaa @jomadechttps://github.com/jomadec and @dts12263https://github.com/dts12263 . We couldn't have workout this without you!

— Reply to this email directly, view it on GitHubhttps://github.com/ansys/pymapdl/issues/2520#issuecomment-1838217644, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDDKUVTSYRYFERKY5ZPEQLYHWOGHAVCNFSM6AAAAAA7U5X2MKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYGIYTONRUGQ. You are receiving this because you were mentioned.Message ID: @.***>

germa89 commented 8 months ago

If anyone is interested, I recently got this information from running codespaces:

Processor Model: AMD EPYC 7763 64-Core Processor

Funny my codespace was using only 1 physical processor.