Closed ikalash closed 2 years ago
@ikalash, I'm just noting things as I notice them. I should have started a review instead, sorry about that. In any case, there's no need to start changing things until I get things working for me on Anvil (and Compy). I'll let you know when I'm done reviewing and want you to revise the branch.
@ikalash, I'm a bit stuck because I can't get boost
to build:
==> Error: ProcessError: Command exited with status 1:
'./bootstrap.sh' '--prefix=/lcrc/soft/climate/compass/anvil/spack/spack_for_mache_1.3.0/opt/spack/linux-centos7-sandybridge/intel-20.0.4/boost-1.77.0-nkejt6mxagtwnt4cqw7nczrc6bnseutw' '--with-toolset=intel-linux' '--with-libraries=test,filesystem,graph,thread,system,exception,locale,timer,log,random,iostreams,program_options,regex,wave,atomic,math,chrono,date_time,serialization' '--without-icu'
7 errors found in build log:
13
14
15 ###
16 ###
17
18 > icpc -x c++ -std=c++11 -O3 -s -static -DNDEBUG builtins.cpp class.
cpp command.cpp compile.cpp constants.cpp cwd.cpp debug.cpp debugger
.cpp execcmd.cpp execnt.cpp execunix.cpp filesys.cpp filent.cpp file
unix.cpp frames.cpp function.cpp glob.cpp hash.cpp hcache.cpp hdrmac
ro.cpp headers.cpp jam_strings.cpp jam.cpp jamgram.cpp lists.cpp mak
e.cpp make1.cpp md5.cpp mem.cpp modules.cpp native.cpp object.cpp op
tion.cpp output.cpp parse.cpp pathnt.cpp pathsys.cpp pathunix.cpp re
gexp.cpp rules.cpp scan.cpp search.cpp startup.cpp subst.cpp sysinfo
.cpp timestamp.cpp variable.cpp w32_getreg.cpp modules/order.cpp mod
ules/path.cpp modules/property-set.cpp modules/regex.cpp modules/seq
uence.cpp modules/set.cpp -o b2
>> 19 ld: cannot find -lstdc++
>> 20 ld: cannot find -lm
>> 21 ld: cannot find -lstdc++
>> 22 ld: cannot find -lc
>> 23 ld: cannot find -ldl
>> 24 ld: cannot find -lc
25 > cp b2 bjam
>> 26 cp: cannot stat 'b2': No such file or directory
27
28 Failed to build B2 build engine
From what I gather, this indicates that static standard libraries aren't available. I haven't figured out a way around this yet.
Hmmm. I did not touch boost. What happens if you try to build boost in develop (without my changes)? This may be something to take up with slack developers.
@ikalash, the problem is definitely not with anything you've changed. Boost just doesn't build with the compilers as I have them set up on Anvil, and I just haven't tried before. I think there's maybe a path I need to include (I don't build "dirty") so I'll work on it. I just wondered if you'd ever had this problem.
@xylar ah ok, I see, thanks for clarifying. I have never had boost issues in the spack build. What compilers are you trying to use on Anvil?
BTW, I created a wiki on how to build Albany with spack: https://github.com/sandialabs/Albany/wiki/Building-Albany-using-SPACK . Note that Trilinos requires compilers that are newer than a certain version (it's documented in the wiki). If Anvil does not have one of those compilers, you probably want to use spack to install a supported compiler, and use that. Your Trilinos will not build if you are using too old of a compiler.
Building with gcc worked fine but I tried 2 different intel versions and boost fails the same way for both.
Those compiler versions required by Albany will be no problem. We always use newer versions than those. So I just need to get boost working...
Does it make sense to open an issue on the main spack page about the boost errors with certain compilers? I assume you’ll have those problems with the main spack repo as well.
@ikalash, maybe. I want to see if a couple more days of struggling on my own produce any progress on boost. I'll ask for help if not. My sense is that its one or the other of a) needing static standard libraries or b) not having the location of the standard libraries in the modules I'm using. I'll also try other machines (I'm trying Chrysalis right now) to see if this is an Anvil problem. I'd be surprised if this is a common problem and no one has reported it before.
Do you need static libs because dynamic ones don’t run on anvil?
I don't thinks so. In the process of Googling the error message I'm seeing with boost, it seemed like similar errors tended to occur when builds were expecting static libraries, and that led me to believe boost itself might be expecting static libraries.
But it seems equally likely that the standard libraries aren't in the path set by the module that spack is using for the compilers, or that spack isn't using all the paths from the module (I've certainly faced this in other circumstances before) so it could be that I just need to set some environment variables to get this working right, and dynamic libraries will work fine.
Anyway, don't worry about it for now. I'll keep working away at it and get back to you soon. Hopefully, this doesn't hold you up too long. I know you've already updated documentation with the assumption you can use the develop
branch here so I don't want to delay longer than necessary.
A quick update: I'm seeing the exact same issue on Chrysalis with Intel, so it's at least not super specific to Anvil. I may try older versions of boost
to see if it might be a recent issue there.
@xylar : I think I addressed all your comments / requests for changes.
I think I found more on the boost
issue:
https://github.com/bfgroup/b2/pull/133
It looks like the problem was recently fixed, so I'll try adding a patch to pull in this change.
I'm making progress. Using https://github.com/E3SM-Project/spack/pull/3, I'm able to build boost
with Intel everywhere I've tried so far.
I'm running into different problems building trilinos-for-albany
on Chrysalis, Anvil and Badger. The Badger problem is an old CMake so I think I can solve that pretty easily.
The Anvil problem is:
CMake Error at packages/kokkos/cmake/kokkos_test_cxx_std.cmake:99 (MESSAGE):
C++14-compliant compiler detected, but unable to compile C++14 or later
program. Verify that Intel:19.1.3.20200925 is set up correctly (e.g.,
check that correct library headers are being used).
see /tmp/ac.xylar/spack-stage/spack-stage-trilinos-for-albany-develop-3bqfpthtni3s3xb3div2rtqjcf6jvcyy/spack-build-out.txt
The Chrysalis problem is even less helpful:
...
": internal error: ** The compiler has encountered an unexpected problem.
** Segmentation violation signal raised. **
Access violation or stack overflow. Please contact Intel Support for assistance.
icpc: error #10105: /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/intel-20.0.4-kodw73g/compilers_and_libraries_2020.4.304/linux/bin/intel64/mcpcom: core dumped
icpc: warning #10102: unknown signal(-544540656)
icpc: error #10106: Fatal error in /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/intel-20.0.4-kodw73g/compilers_and_libraries_2020.4.304/linux/bin/intel64/mcpcom, terminated by unknown
icpc: error #10014: problem during multi-file optimization compilation (code 1)
see /tmp/ac.xylar/spack-stage/spack-stage-trilinos-for-albany-develop-ipai3rbjpgycq6y5i7cybmajgaroiclb/spack-build-out.txt
@ikalash, I'm going to keep working on things but would very much appreciated any insight you might be able to give.
I noticed yesterday that there's a special sandybridge
flag in spack, and that Anvil is sandybridge
. I'll try rebuilding with that flag in case that helps on Anvil in particular.
That won't help with the problem on Chrysalis. My idea there would be trying to build in serial, just in case that helps.
I'm seeing the same error on Compy as on Anvil (about Intel 19 not being configured for C++14), and that machine is skylake
, not sandybridge
, so I don't think the architecture is the problem there.
I'm seeing the same error on Badger (on LANL's institutional computing) and on Cori-Haswell as on Chrysalis (about a "segmentation violation") . That machines are also using Intel 19, so it's not clear to me why these different errors are emerging.
Thanks for working on this @xylar . It looks like Kokkos thinks there is something wrong with the Intel compiler on anvil I believe Intel 19 should be C++14 compliant. Is this a default compiler on the machine? Could you try a different intel compiler? Since I don't have access to chrysalis, could you please provide me with a full log file containing the error? It is hard to understand what is wrong from the output you posted, but it also sounds like the intel compiler is messed up.
Speaking of which, I actually never tried building Albany using spack with an intel compiler. I could try this to see if I get similar errors. Let me know if you think this would be helpful.
Just to follow up: I got a boost error too on Cori with intel-19.1.3.304:
5de24f7a46bc083bd6df06
==> Applied patch /global/u2/i/ikalash/spack/var/spack/repos/builtin/packages/boost/bootstrap-path.patch
==> Applied patch https://github.com/bfgroup/b2/commit/23212066f0f20358db54568bb16b3fe1d76f88ce.patch
==> Ran patch() for boost
==> boost: Executing phase: 'install'
==> Error: ProcessError: Command exited with status 1:
'./b2' '--clean' '-j' '16' '--user-config=/global/cscratch1/sd/ikalash/spack-stage/spack-stage-boost-1.77.0-odeti2r7qq74gjd2sli5x7tvgdencl5x/spack-src/user-config.jam' 'variant=release' '--disable-icu' '-s' 'BZIP2_INCLUDE=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/intel-19.1.3.304/bzip2-1.0.8-x466bu6q7ktyagfnjw3zh5rbdxax4ep5/include' '-s' 'BZIP2_LIBPATH=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/intel-19.1.3.304/bzip2-1.0.8-x466bu6q7ktyagfnjw3zh5rbdxax4ep5/lib' '-s' 'ZLIB_INCLUDE=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/intel-19.1.3.304/zlib-1.2.11-5uo3fd62mcdefkurpuu6naofzdl24mbm/include' '-s' 'ZLIB_LIBPATH=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/intel-19.1.3.304/zlib-1.2.11-5uo3fd62mcdefkurpuu6naofzdl24mbm/lib' '-s' 'NO_LZMA=1' '-s' 'NO_ZSTD=1' 'link=static,shared' '--layout=system' 'toolset=intel-linux' 'cxxstd=98' 'visibility=hidden'
10 errors found in build log:
42 http://www.boost.org/more/getting_started/unix-variants.html
43
44 - B2 documentation:
45 http://www.boost.org/build/
46
47 ==> [2022-03-04-19:24:37.632837] './b2' '--clean' '-j' '16' '--user-config=/global/cscratch1/sd/ikalash/spack-stage/spack-st
age-boost-1.77.0-odeti2r7qq74gjd2sli5x7tvgdencl5x/spack-src/user-config.jam' 'variant=release' '--disable-icu' '-s' 'BZIP2_I
NCLUDE=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/intel-19.1.3.304/bzip2-1.0.8-x466bu6q7ktyagfnjw3zh5rbdxax4ep5/
include' '-s' 'BZIP2_LIBPATH=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/intel-19.1.3.304/bzip2-1.0.8-x466bu6q7kt
yagfnjw3zh5rbdxax4ep5/lib' '-s' 'ZLIB_INCLUDE=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/intel-19.1.3.304/zlib-1
.2.11-5uo3fd62mcdefkurpuu6naofzdl24mbm/include' '-s' 'ZLIB_LIBPATH=/global/u2/i/ikalash/spack/opt/spack/cray-cnl7-haswell/in
tel-19.1.3.304/zlib-1.2.11-5uo3fd62mcdefkurpuu6naofzdl24mbm/lib' '-s' 'NO_LZMA=1' '-s' 'NO_ZSTD=1' 'link=static,shared' '--l
ayout=system' 'toolset=intel-linux' 'cxxstd=98' 'visibility=hidden'
>> 48 /global/cscratch1/sd/ikalash/spack-stage/spack-stage-boost-1.77.0-odeti2r7qq74gjd2sli5x7tvgdencl5x/spack-src/tools/build/src
/tools/intel-linux.jam:96: in intel-linux.init
It looks like Kokkos thinks there is something wrong with the Intel compiler on anvil I believe Intel 19 should be C++14 compliant. Is this a default compiler on the machine? Could you try a different intel compiler?
It's actually Intel 20 on Anvil. That's the default for E3SM. We have a strong preference for using the same compilers as E3SM wherever possible. I can try a different Intel version but I'm seeing the same error on Compy with Intel 19 so I have strong doubts that it's going to be different with a different compiler version. Intel is also always our preferred compiler. We will use Gnu as well but that's always the second choice because it isn't the E3SM production compiler.
Since I don't have access to chrysalis, could you please provide me with a full log file containing the error? It is hard to understand what is wrong from the output you posted, but it also sounds like the intel compiler is messed up.
I saw the same errors on Cori-Haswell and LANL's badger. I'll try to point you to the log on Cori-Haswell.
Speaking of which, I actually never tried building Albany using spack with an intel compiler. I could try this to see if I get similar errors. Let me know if you think this would be helpful.
Yeah, I think this is necessary. Like I said, Intel is our workhourse.
Just to follow up: I got a boost error too on Cori with intel-19.1.3.304:
Could you rebase this branch onto the latest E3SM-Project/spack/develop
? I was able to build boost just fine wit that branch. (In the process of rebasing, you could take out the commits that modify the trilinos
package and then revert if you like.)
On Anvil, I tried building as close a version of trilinos
from spack/spack
as I could to the flags that you use for trilinos-for-albany
. That turned out to be leaving out the pnetcdf
flag. The build failed, but in a pretty different place:
197 In file included from mat.c(32):
>> 198 safe-math.h(616): error: argument is incompatible with formal param
eter
199 PSNIP_SAFE_DEFINE_BUILTIN_BINARY_OP(char, char, add)
200 ^
and a bunch more like this.
See /tmp/ac.xylar/spack-stage/spack-stage-matio-1.5.17-pyanqousfuyucrgybvazsuojkr3rzafn/spack-build-out.txt
.
Here's the log from Chrysalis. spack-build-out.txt
Yes I could do the rebase, but can’t we just merge this branch into develop? I’ve fixed the trilinos package.pay file already, per your comments.
We can merge it, sure. I would in theory prefer to make sure it's working as expected but we can follow up.
Thanks for posting the chrysallis output. I actually have seen that error before. I believe it's a compiler error, see e.g., https://community.intel.com/t5/Intel-C-Compiler/Compiler-Error-quot-Segmentation-violation-signal-raised-quot/td-p/1075456 .
@ikalash, thanks for that link. This seems like a really tough one to work on, since it happens on 3 of the 5 machines we want to support and each is using a different compiler version as far as I can tell. We also have somewhat limited options to push E3SM to update its compiler versions because of Kokkos problems they're seemingly not seeing in their own builds. Still, if we can show that a different Intel version works where this one doesn't that would be helpful.
@ikalash, it seems like trilinos might be installing its won Kokkos, rather than using the kokkos spack package. At least I don't see it calling depends_on()
with kokkos
anywhere. Do you know how that works?
What I'm trying to figure out is if I can just install Kokkos configured as Trilinos needs it to make sure that much works with Intel.
Right the trilinos build would be using its own kokkos. Kokkos is included as a part of trilinos, and trilinos expects a certain directory structure, so I’m not sure how to tell it to use a different kokkos. Manually, to use a different kokkos, I would go to trilinos/packages, delete the kokkos that is there and out a different kokkos in place of it.
I take it you haven’t installed MALI on the machines with the buggy intel compiler before? That error would show up when building trilinos regardless of whether you used spack to build it or not. The beauty of spack is you can build a different compiler and use it, but I understand that e3sm has supported compilers which you have to use.
I really doubt Intel is buggy on all 5 machines we want to support. Cori-Haswell, for example, is one where we've installed Trilinos in the past though maybe only with gnu, not intel.
I think it's problematic that Trilinos uses its own Kokkos. This really seems to counteract the advantages of Spack to keep packages consistent across an environment and is a really problem in general for package management. In general such so-called "vendoring" of third-party packages is a real headache for package management in general. We need to figure out if there's a way to avoid this or we won't be able to make progress on this.
I can ask the Kokkos team if there is a workaround to get Kokkos to work in the way you are suggesting. Regarding intel: I am not saying all versions of intel are buggy. I have several nightly intel builds on various machines that use Intel. The versions there are 19.0.5 and 18.1.163. I think some of the issues you were seeing had to do with boost, not the compiler error, right? On Cori, I got the boost error when I tried building Albany with an intel compiler.
No, I got past the boost problem on all 5 machines. I am see one or the other of the intel Kokkos problems on all 5 machines, see above, which is what makes me skeptical that either is an intel (as opposed to a Kokkos) bug.
I see. If you think there is a Kokkos bug, I suggest you open an issue under Kokkos: https://github.com/kokkos/kokkos/issues . They are usually very responsive.
So that's the problem. Because Trilinos builds Kokkos, I'm having trouble isolating the problem enough to make a bug report. As it stands, I would beed to make a report on our build of Kokkos via Trilinos for Albany that differs from the main Spack build and presumably also the SNL build of Trilinos. This is a pretty tough situation to hope for help with.
I really need a spack (or other) build of Kokkos that reproduces the problem to make a bug report. This is a pretty frustrating process...
Would you like to meet for a brief chat this coming week, to try to see if we can figure out the best way to try to make progress? It'll be easier to discuss what you've tried and ideas over Teams than in a github issue, I think.
I can certainly try to help pin-point the problem, but there are a lot of different things we discussed above, and it's starting to become difficult to keep track of what issue is present on what system... I think you could clarify all of that to me over Teams pretty fast.
Sure, we can meet. but it's not going to work out for this week. I think it's going to better to try to work as best we can on GitHub, Slack, etc.
The situation is still as in https://github.com/E3SM-Project/spack/pull/1#issuecomment-1059339659. On Anvil and Compy, I get c++14 errors that just don't make sense to me. On the rest (Chrysalis, Badger and Cori-Haswell), I get the segmentation violation error. I need a way to make a simple reproducer before I can make a bug report.
@ikalash, could you try to help me figure out what the equivalent Kokkos build with Soack would be to what trilinos-with-albany is doing? That could be the reproducer I need.
You mean, you want to build a stand-alone Kokkos that is like the one built as part of trilinos-with-albany?
BTW, check this out: https://github.com/kokkos/kokkos/issues/4475 . The icpc error was a Kokkos issue some time back.
You mean, you want to build a stand-alone Kokkos that is like the one built as part of trilinos-with-albany?
Yes, that's it exactly.
BTW, check this out: https://github.com/kokkos/kokkos/issues/4475 . The icpc error was a Kokkos issue some time back.
I looked at that issue but the error messages seem different from either of the problems I'm seeing. At least, I didn't see the same error messages listed as I posted above. Maybe I missed a similarity that's in the larger log file (24 MB!)
One piece of good news, I was able to build on Anvil with Gnu compilers without any trouble.
I think to reproduce the Kokkos error in stand-alone Kokkos, one just needs to figure out what version of Kokkos is in the Trilinos that is being used by spack. I can do that. That's of course provided that the error shows up in stand-alone Kokkos and not in some Kokkos-Trilinos interactions.
Since there was a lot of discussion above and I sort of lost track of what problems have / have not been resolved, is this the error you are encountering in Kokkos:
CMake Error at packages/kokkos/cmake/kokkos_test_cxx_std.cmake:99 (MESSAGE):
C++14-compliant compiler detected, but unable to compile C++14 or later
program. Verify that Intel:19.1.3.20200925 is set up correctly (e.g.,
check that correct library headers are being used).
Or is it that icpc one?
@ikalash, thanks for your patience on this. As I'm sure was obvious, I've been feeling frustrated with this process. I thought I was prepared for a potentially rocky road but I was caught by surprise by how opaque trilinos is for me.
Regarding Kokkos, it seems that Trilinos is vendoring v3.5.0, the latest release, see https://github.com/trilinos/Trilinos/pull/9958. There were no more recent commits in the packages/Kokkos directory: https://github.com/trilinos/Trilinos/tree/master/packages/kokkos
I think I now also understand better that Trilinos is, by design, a collection of packages -- its own mini package manager in a way. So vendoring packages is what it's designed to do. I still think this has the potential to be problematic if the same package (but different versions) are needed by other software, e.g. in E3SM. Indeed, it seems like we've run into problems related to multiple installations of Kokkos in E3SM already. Any solution to these problems is likely to require a single Kokkos installation that everyone shares.
You are right, the only error I can clearly tie to Kokkos is the unable to compile C++14 or later program
. The other error ** Segmentation violation signal raised. **
occurs when building these two files:
stk/stk_unit_test_utils/stk_unit_test_utils/stk_mesh_fixtures/degenerate_mesh.cpp
stk/stk_unit_test_utils/stk_unit_test_utils/stk_mesh_fixtures/WedgeFixture.cpp
Before we open a Trilinos issue, I do have one idea. If the problem is just building STK unit tests, we can turn them off in the trilinos-for-albany build. Let me try to do this now.
@ikalash, I was able to reproduce the C++14
problem by just installing Kokkos v3.5.0 from Spack. I was then able to reproduce it with just the test program that Kokkos is trying to run. In that process, I discovered that I need to be loading a gcc
module for Intel to work properly, something I was not fully aware of. So that problem seems to lie in me not loading the required modules.
The other problem (the compiler crashes) may be a later manifestation of the same (or it may be yet another issue). I tried building in serial (adding parallel = False
to the trilinos-for-albany
package) and that didn't help.
Anyway, it looks like I'm making some headway.
Okay, I'm now seeing the same error on Anvil as on other machines (I'm still rerunning on Compy so I don't know what the situation there is yet):
47951 from /tmp/ac.xylar/spack-stage/spack-stage-tri
linos-for-albany-develop-3bqfpthtni3s3xb3div2rtqjcf6jvcyy/spack
-src/packages/stk/stk_unit_test_utils/stk_unit_test_utils/Gener
ateALefRAMesh.cpp(35):
47952 /tmp/ac.xylar/spack-stage/spack-stage-trilinos-for-albany-devel
op-3bqfpthtni3s3xb3div2rtqjcf6jvcyy/spack-src/packages/kokkos/c
ore/src/Kokkos_Concepts.hpp(376): warning #2651: attribute does
not apply to any entity
47953 using host_mirror_space KOKKOS_DEPRECATED = std::conditiona
l_t<
47954 ^
47955
47956
>> 47957 ": internal error: ** The compiler has encountered an unexpecte
d problem.
47958 ** Segmentation violation signal raised. **
47959 Access violation or stack overflow. Please contact Intel Suppor
t for assistance.
47960
>> 47961 icpc: error #10014: problem during multi-file optimization comp
ilation (code 4)
>> 47962 make[2]: *** [packages/stk/stk_balance/stk_balance/stk_balance.
exe] Error 4
47963 make[2]: Leaving directory `/scratch/ac.xylar/spack-stage/spack
-stage-trilinos-for-albany-develop-3bqfpthtni3s3xb3div2rtqjcf6j
vcyy/spack-build-3bqfpth'
>> 47964 make[1]: *** [packages/stk/stk_balance/stk_balance/CMakeFiles/s
tk_balance.dir/all] Error 2
Attached is the full log file as well as a reproducer for Anvil (change the spack_path
to somewhere you have write permission and then run the build*
script with the dev*.yaml
in the same directory).
spack-build-out.txt
anvil_reproducer.zip
@ikalash, let me know if you have time to check this out, and if so if you have any thoughts or suggestions for what to try. I can make an Trilinos issue if you don't have any more ideas.
The idea is that this code will replace the currently-supported SNLComputation/spack fork, when it comes to building Albany. Albany can be built with MPAS-Interface code enabled, for building MPAS on top of Albany to obtain MALI. Note that once this PR is merged in, there will be a special Trilinos spackage, known as trilinos-for-albany, which will be used when building Albany. This has different configure options than the main trilinos spackage.