JeffersonLab / HDGeant4

Geant4 simulation for the GlueX experiment
4 stars 4 forks source link

Crash adding daughters on RHEL/CentOS 8 #195

Closed markito3 closed 2 years ago

markito3 commented 3 years ago

I first heard of this from @lihaoahil via @nsjarvis on September 3:

Hi Mark,

Hao found that hdgeant4 does not run on CentOS 8. He used the two version sets available 4.42.1 and 4.45.1.

I tried out your test files in u/scratch/marki/hdg4t using version set 4.45.1 on jlabl5 and saw an error similar to Hao's. His message is forwarded below. He mentioned albert and red queue first; those are running RHEL7.

Naomi.

---------- Forwarded message --------- From: Hao Li hl2@andrew.cmu.edu Date: Fri, Sep 3, 2021 at 5:30 AM Subject: hdgeant4 crashed on ernest To: Naomi Jarvis nsj@cmu.edu

Hi Naomi, So firstly 4.45.1 works perfectly fine on albert and red queue.

Then I tested hdgeant4 with both version set 4.42.1 and 4.45.1 on ernest's interactive node and found it crashed. I have a folder at ~haoli/ernest with the env file and input&config for hdgeant4 (control.in, input.hddm and run.mac) for you in case you'd like to reproduce the crash.

Cheers, Hao

BTW, if it makes any sense to you, the crash message look something like this:


Geant4 version Name: geant4-10-02-patch-02 [MT] (17-June-2016) << in Multi-threaded mode >> Copyright : Geant4 Collaboration Reference : NIM A 506 (2003), 250-303 WWW : http://cern.ch/geant4


JANA >>Created JCalibration object of type: JCalibrationCCDB JANA >>Generated via: JCalibration using CCDB for MySQL and SQLite databases JANA >>Run:30274 JANA >>URL: mysql://ccdb_user@hallddb.jlab.org/ccdb JANA >>context: variation=mc JANA >>comment: Variation for simulations with data conditions

There was a crash. This is the entire stack trace of all threads:

0 0x00007fbccdbf6aab in waitpid () from /lib64/libc.so.6

1 0x00007fbccdb724af in do_system () from /lib64/libc.so.6

2 0x00007fbcd6dd8af7 in TUnixSystem::Exec (shellcmd=, this=0x238eef0) at /home/gluex2/gluex_top8/root/root-6.08.06/core/unix/src/TUnixSystem.cxx:2118

3 TUnixSystem::StackTrace (this=0x238eef0) at /home/gluex2/gluex_top8/root/root-6.08.06/core/unix/src/TUnixSystem.cxx:2405

4 0x00007fbcd6ddab24 in TUnixSystem::DispatchSignals (this=0x238eef0, sig=kSigSegmentationViolation) at /home/gluex2/gluex_top8/root/root-6.08.06/core/unix/src/TUnixSystem.cxx:3625

5

6 0x0000000000000000 in ?? ()

7 0x00007fbce0c6ff82 in G4LogicalVolume::AddDaughter (this=0x28c25f0, pNewDaughter=) at /home/gluex2/gluex_top8/geant4/geant4.10.02.p02/include/Geant4/G4LogicalVolume.icc:165

8 0x00007fbcd9ea953f in G4PVPlacement::G4PVPlacement (this=0x29065e0, pRot=, tlate=..., pCurrentLogical=0x281fd20, pName=..., pMotherLogical=0x28c25f0, pMany=false, pCopyNo=1, pSurfChk=false) at /home/gluex2/gluex_top8/geant4/geant4.10.02.p02/source/geometry/volumes/src/G4PVPlacement.cc:117

9 0x00007fbce0ddf7e3 in HddsG4Builder::createVolume (this=0x26d1340, el=, ref=...) at /home/gluex2/gluex_top8/geant4/geant4.10.02.p02/include/Geant4/G4LogicalVolume.icc:61

10 0x00007fbcd29867ce in CodeWriter::createVolume (this=this

entry=0x26d1340, el=el entry=0x32032e8, ref=...) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:1337

11 0x00007fbce0ddf45f in HddsG4Builder::createVolume (this=0x26d1340, el=0x32032e8, ref=...) at src/HddsG4Builder.cc:562

12 0x00007fbcd2986dca in CodeWriter::createVolume (this=this

entry=0x26d1340, el=el entry=0x3201788, ref=...) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:1431

13 0x00007fbce0ddf45f in HddsG4Builder::createVolume (this=0x26d1340, el=0x3201788, ref=...) at src/HddsG4Builder.cc:562

14 0x00007fbcd2986dca in CodeWriter::createVolume (this=this

entry=0x26d1340, el=el entry=0x35fc758, ref=...) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:1431

15 0x00007fbce0ddf45f in HddsG4Builder::createVolume (this=0x26d1340, el=0x35fc758, ref=...) at src/HddsG4Builder.cc:562

16 0x00007fbcd29846c4 in CodeWriter::translate (this=this

entry=0x26d1340, topel=topel entry=0x35fc758) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:2228

17 0x00007fbce0dd62fa in HddsG4Builder::translate (this=this

entry=0x26d1340, topel=topel entry=0x35fc758) at src/HddsG4Builder.cc:1401

18 0x00007fbce0c6ab84 in GlueXDetectorConstruction::GlueXDetectorConstruction (this=0x26d1300, hddsFile=...) at src/GlueXDetectorConstruction.cc:179

19 0x00007fbce015cabf in main (argc=1, argv=0x7ffcff608718) at /usr/include/c++/8/bits/allocator.h:139

20 0x00007fbccdb51493 in __libc_start_main () from /lib64/libc.so.6

21 0x000000000073bf1e in _start ()

The lines below might hint at the cause of the crash. You may get help by asking at the ROOT forum http://root.cern.ch/forum. Only if you are really convinced it is a bug in ROOT then please submit a report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace from above as an attachment in addition to anything else that might help us fixing this issue.

6 0x0000000000000000 in ?? ()

7 0x00007fbce0c6ff82 in G4LogicalVolume::AddDaughter (this=0x28c25f0, pNewDaughter=) at /home/gluex2/gluex_top8/geant4/geant4.10.02.p02/include/Geant4/G4LogicalVolume.icc:165

8 0x00007fbcd9ea953f in G4PVPlacement::G4PVPlacement (this=0x29065e0, pRot=, tlate=..., pCurrentLogical=0x281fd20, pName=..., pMotherLogical=0x28c25f0, pMany=false, pCopyNo=1, pSurfChk=false) at /home/gluex2/gluex_top8/geant4/geant4.10.02.p02/source/geometry/volumes/src/G4PVPlacement.cc:117

9 0x00007fbce0ddf7e3 in HddsG4Builder::createVolume (this=0x26d1340, el=, ref=...) at /home/gluex2/gluex_top8/geant4/geant4.10.02.p02/include/Geant4/G4LogicalVolume.icc:61

10 0x00007fbcd29867ce in CodeWriter::createVolume (this=this

entry=0x26d1340, el=el entry=0x32032e8, ref=...) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:1337

11 0x00007fbce0ddf45f in HddsG4Builder::createVolume (this=0x26d1340, el=0x32032e8, ref=...) at src/HddsG4Builder.cc:562

12 0x00007fbcd2986dca in CodeWriter::createVolume (this=this

entry=0x26d1340, el=el entry=0x3201788, ref=...) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:1431

13 0x00007fbce0ddf45f in HddsG4Builder::createVolume (this=0x26d1340, el=0x3201788, ref=...) at src/HddsG4Builder.cc:562

14 0x00007fbcd2986dca in CodeWriter::createVolume (this=this

entry=0x26d1340, el=el entry=0x35fc758, ref=...) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:1431

15 0x00007fbce0ddf45f in HddsG4Builder::createVolume (this=0x26d1340, el=0x35fc758, ref=...) at src/HddsG4Builder.cc:562

16 0x00007fbcd29846c4 in CodeWriter::translate (this=this

entry=0x26d1340, topel=topel entry=0x35fc758) at /home/gluex2/gluex_top8/hdds/hdds-4.14.0/hddsCommon.cpp:2228

17 0x00007fbce0dd62fa in HddsG4Builder::translate (this=this

entry=0x26d1340, topel=topel entry=0x35fc758) at src/HddsG4Builder.cc:1401

18 0x00007fbce0c6ab84 in GlueXDetectorConstruction::GlueXDetectorConstruction (this=0x26d1300, hddsFile=...) at src/GlueXDetectorConstruction.cc:179

19 0x00007fbce015cabf in main (argc=1, argv=0x7ffcff608718) at /usr/include/c++/8/bits/allocator.h:139

20 0x00007fbccdb51493 in __libc_start_main () from /lib64/libc.so.6

21 0x000000000073bf1e in _start ()

27.338u 2.113s 0:40.09 73.4% 0+0k 0+40io 0pf+0w

markito3 commented 3 years ago

I see the same error in a CentOS 8 Singularity container and on jlabl5, a RHEL 8 node. Both run GCC 8.4.1. I see it on both version sets 4.44.0 and 4.45.1.

I do not see the error with 4.45.1 on the ifarm (CentOS 7). My suspicion is that it has something to do with the version of GCC we use, i.e., 8.4.1 breaks something in the code or in the build procedure.

rjones30 commented 3 years ago

I have managed to reproduce the problem on my Centos7 server at UConn, by doing the following:

  1. select the gcc compiler version 9.4 using "scl enable devtoolset-9 bash"
  2. setup the environment to compile/link against G4.10.02p02.
  3. build hdgeant4 from scratch using "make clean && make"
  4. test the build by "cd test; hdgeant4" -- it crashes

Change anything in the above recipe and it no longer crashes:

  1. turn off G4MULTITHREADED in GNUmakefile, and build with -O0 and debug symbols -- no more crashes
  2. do not change GNUmakefile, but move forward from G4.10.02p02 to G4.10.04p02 or G4.10.06p01 -- no more crashes

Is this an artifact of using G4MULTITHREADED with an old version of G4.10.02p02? Maybe.

rjones30 commented 3 years ago

Changing nothing in the code, but simply switching from MT (multi-threaded build of G4 libs) to non-MT (without the MT code features) in G4.10.02 makes the problem disappear. Meanwhile, the problem with the MT build goes away when you move forward from G4.10.02 to G4.10.04p02 or G4.10.06p01.

I propose that this is a defect in the MT functionality in the G4.10.02p02 release of the G4 library. Remember that MT functionality was new in G4.10, and early releases of G4.10 still had bugs in the MT code. Can we just leave G4.10.02 behind?

markito3 commented 3 years ago

fine with me :-)

rjones30 commented 3 years ago

Actually, the disappearance of the crash in my build may just be due to the fact that the G4 developers moved G4LogicalVolume::AddDaughter from inline G4Logical.icc source file to the G4Logical.cc implementation file. This has the side effect of hiding the issue if it is related to the g++ compiler version because it means the AddDaughter in G4.10.04 and following was compiled using the pre-g++8.4 compiler in my specific test environment. To know for sure, I need to run my test against a build of G4 that was compiled with the post-g++8.4 compiler, as well as the HDGeant4 code.

nsjarvis commented 3 years ago

Initial tests with G4.10.4.p02 look good. Thanks to Mark Ito for the build_scripts help. Waiting for Hao to confirm.

markito3 commented 2 years ago

Tried the b1pi test with hdgeant4 with G4 10.04 on Fedora 34. No problems. This is the first time I have seen it work on Fedora versions greater than 30. Looks like leaving G4 10.02 in the rear-view might be the answer here. For reference here is the error, on Fedora 33, back in March.

nsjarvis commented 2 years ago

@lihaoahil for when you get a moment.

lihaoahil commented 2 years ago

Sorry guys I forgot to close this issue here. All good now.