JeffersonLab / analyzer

HallA C++ Analyzer
BSD 3-Clause "New" or "Revised" License
7 stars 54 forks source link

SCons build with ROOT 6 causes interpreter segfaults #141

Closed hansenjo closed 6 years ago

hansenjo commented 6 years ago

This is a nasty one, I think.

If, and apparently only if, I build the analyzer with scons (v2.5.1 from EPEL) and then issue some C++11 commands from the interpreter, I frequently (but not always!) get segfaults. Example:

  ************************************************
  *                                              *
  *            W E L C O M E  to  the            *
  *       H A L L A   C++  A N A L Y Z E R       *
  *                                              *
  *  Release      1.6.0-beta3        Sep 20 2017 *
  *  Based on ROOT  6.10/04          Jul 28 2017 *
  *                                              *
  *            For information visit             *
  *        http://hallaweb.jlab.org/podd/        *
  *                                              *
  ************************************************
analyzer [0] vector<int> vi { 1,2,4,5,6,9,-10,-20 }
(std::vector<int> &) { 1, 2, 4, 5, 6, 9, -10, -20 }
analyzer [1] for( auto& i : vi ) cout << i << endl;

 *** Break *** segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007fbf19abddbc in __libc_waitpid (pid=11594, stat_loc=stat_loc
entry=0x7fff734c7f60, options=options
entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:31
#1  0x00007fbf19a40cc2 in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:148
#2  0x00007fbf1d7a47df in TUnixSystem::StackTrace (this=0x6298e0) at /opt/ROOT/root-6.10.04/core/unix/src/TUnixSystem.cxx:2412
#3  0x00007fbf1d7a6f2c in TUnixSystem::DispatchSignals (this=0x6298e0, sig=kSigSegmentationViolation) at /opt/ROOT/root-6.10.04/core/unix/src/TUnixSystem.cxx:3643
#4  <signal handler called>
#5  0x00007fbf1a58d183 in std::ostream::operator<< (this=0x7fbf1a7fb700 <std::cout>, __n=1) at /usr/src/debug/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/ostream.tcc:110
#6  0x00007fbf1e4f60a7 in ?? ()
#7  0x00007fff734ca6e8 in ?? ()
#8  0x0000000001b12f60 in ?? ()
#9  0x0000000001b12f80 in ?? ()
#10 0x00007fff734caab0 in ?? ()
#11 0x0000000000000000 in ?? ()
===========================================================

The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum.
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00007fbf1a58d183 in std::ostream::operator<< (this=0x7fbf1a7fb700 <std::cout>, __n=1) at /usr/src/debug/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/ostream.tcc:110
#6  0x00007fbf1e4f60a7 in ?? ()
#7  0x00007fff734ca6e8 in ?? ()
#8  0x0000000001b12f60 in ?? ()
#9  0x0000000001b12f80 in ?? ()
#10 0x00007fff734caab0 in ?? ()
#11 0x0000000000000000 in ?? ()
===========================================================

Root > 

Here's where it gets nasty:

I have already tried a number of variations on the compiler flags used by SCons, but so far nothing has made a difference. In particular, I have prevented -rdynamic to be parsed into the CXXFLAGS and used it only as a link flag, as the make build does. I've also reordered linker flags and manually re-linked libHall.so, libdc.so and the main executable. At this point, I'm stumped.

This problem was already present in June before the analysis workshop, so it is not due to a recent change.

brash99 commented 6 years ago

Not sure that I will be able to solve this quickly, or at all. But I will try to look into it further at least.

Quick question: what does one need to install (from EPEL) for root on RHEL7 systems?

yum install root

installs a lot of things, but I don’t see a “thisroot.sh” for setup, for example.

Best, E.

On Sep 20, 2017, at 3:17 PM, Ole Hansen notifications@github.com wrote:

This is a nasty one, I think.

If, and apparently only if, I build the analyzer with scons (v2.5.1 from EPEL) and then issue some C++11 commands from the interpreter, I frequently (but not always!) get segfaults. Example:


  • *
  • W E L C O M E to the *
  • H A L L A C++ A N A L Y Z E R *
  • *
  • Release 1.6.0-beta3 Sep 20 2017 *
  • Based on ROOT 6.10/04 Jul 28 2017 *
  • *
  • For information visit *
  • http://hallaweb.jlab.org/podd/ *
  • *


    analyzer [0] vector vi { 1,2,4,5,6,9,-10,-20 } (std::vector &) { 1, 2, 4, 5, 6, 9, -10, -20 } analyzer [1] for( auto& i : vi ) cout << i << endl;

    Break segmentation violation

=========================================================== There was a crash. This is the entire stack trace of all threads:

0 0x00007fbf19abddbc in __libc_waitpid (pid=11594, stat_loc=stat_loc

entry=0x7fff734c7f60, options=options entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:31

1 0x00007fbf19a40cc2 in do_system (line=) at ../sysdeps/posix/system.c:148

2 0x00007fbf1d7a47df in TUnixSystem::StackTrace (this=0x6298e0) at /opt/ROOT/root-6.10.04/core/unix/src/TUnixSystem.cxx:2412

3 0x00007fbf1d7a6f2c in TUnixSystem::DispatchSignals (this=0x6298e0, sig=kSigSegmentationViolation) at /opt/ROOT/root-6.10.04/core/unix/src/TUnixSystem.cxx:3643

4

5 0x00007fbf1a58d183 in std::ostream::operator<< (this=0x7fbf1a7fb700 , __n=1) at /usr/src/debug/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/ostream.tcc:110

6 0x00007fbf1e4f60a7 in ?? ()

7 0x00007fff734ca6e8 in ?? ()

8 0x0000000001b12f60 in ?? ()

9 0x0000000001b12f80 in ?? ()

10 0x00007fff734caab0 in ?? ()

11 0x0000000000000000 in ?? ()

===========================================================

The lines below might hint at the cause of the crash. You may get help by asking at the ROOT forum http://root.cern.ch/forum. Only if you are really convinced it is a bug in ROOT then please submit a report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace from above as an attachment in addition to anything else that might help us fixing this issue.

5 0x00007fbf1a58d183 in std::ostream::operator<< (this=0x7fbf1a7fb700 , __n=1) at /usr/src/debug/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/ostream.tcc:110

6 0x00007fbf1e4f60a7 in ?? ()

7 0x00007fff734ca6e8 in ?? ()

8 0x0000000001b12f60 in ?? ()

9 0x0000000001b12f80 in ?? ()

10 0x00007fff734caab0 in ?? ()

11 0x0000000000000000 in ?? ()

===========================================================

Root > Here's where it gets nasty:

It isn't 100% reproducible. You may have to try several times (start analyzer, issue interactive commands, exit and restart if it doesn't crash). I am unable to reproduce this crash with the scons build when running under gdb. Under the debugger, it just seems to work. I have never been able to trigger this crash with a make build of the analyzer hcana's SCons build seems unaffected as well. The crash does not occur on macOS when building with either scons or make. So far, I have only seen it on RHEL7 and CentOS7. I have tried both the ROOT version from EPEL (currently 6.10/02) and a self-built ROOT 6.10/04 installation. I am using the standard compiler there: g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16). It happens on several different machines, including the VirtualBox image we made for the analysis workshop this summer. I have already tried a number of variations on the compiler flags used by SCons, but so far nothing has made a difference. In particular, I have prevented -rdynamic to be parsed into the CXXFLAGS and used it only as a link flag, as the make build does. I've also reordered linker flags and manually re-linked libHall.so, libdc.so and the main executable. At this point, I'm stumped.

This problem was already present in June before the analysis workshop, so it is not due to a recent change.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_analyzer_issues_141&d=DwMCaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=oiWBhRfOqMQ1gwJuwfMNa05WxLOp54YAYiZ1kJeY2ws&s=gg7g677au0FyrZMhTuLlkapygPtbQqEwr7FFDWE-aTw&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AE1Pu8lGXZjQDp-2Dj5xzilUXCJ6CMlCc-5Fks5skWTVgaJpZM4PeWMB&d=DwMCaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=oiWBhRfOqMQ1gwJuwfMNa05WxLOp54YAYiZ1kJeY2ws&s=iSwfsiqAfsR4KSxIGNDjEcHaKIvDNaDoFh2mmEtq20I&e=.

Dr. Edward J. Brash

Professor of Physics - Christopher Newport University Staff Scientist - Thomas Jefferson National Accelerator Facility Honorary Senior Research Fellow - University of Glasgow Office: 757-594-7451 Mobile: 757-753-2831 FAX: 757-594-7919

hansenjo commented 6 years ago

Hi Ed,

I think "yum install root" should be all. No need to run thisroot.sh because the EPEL version of ROOT is installed in system directories like /usr/lib64, /usr/include/root etc. which are already in the various PATHs. There is no top-level ROOTSYS directory in that case. The output of "root-config" reflects that.

BTW, the problem also occurs with a self-compiled version of ROOT that IS installed under a top-level ROOTSYS. It looks like this is not a problem specific to the ROOT installation. So if you already have a non-EPEL version of ROOT set up, you could use that.

Ole

brash99 commented 6 years ago

Well, I am able now to reproduce the segfault … both with the EPEL version of ROOT, and with a previous local ROOT installation (6.06/08).

Sigh … I also see the same sort of very intermittent behavior … sometimes it works fine, and sometimes it segfaults.

Best, E.

On Sep 27, 2017, at 2:48 PM, Ole Hansen notifications@github.com wrote:

Hi Ed,

I think "yum install root" should be all. No need to run thisroot.sh because the EPEL version of ROOT is installed in system directories like /usr/lib64, /usr/include/root etc. which are already in the various PATHs. There is no top-level ROOTSYS directory in that case. The output of "root-config" reflects that.

BTW, the problem also occurs with a self-compiled version of ROOT that IS installed under a top-level ROOTSYS. It looks like this is not a problem specific to the ROOT installation. So if you already have a non-EPEL version of ROOT set up, you could use that.

Ole — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_analyzer_issues_141-23issuecomment-2D332619344&d=DwMFaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=QbWufbFIgUeqwexUWORu8Rnd_1J9sQq9FgmzsAbpYzg&s=dozaPf0OHuBV3xvvJFHdy-HNzSrmVkFUeNmzeRmNChY&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AE1PuzxUJ35WszY7LGIbjRxm59Vc0YBvks5smph0gaJpZM4PeWMB&d=DwMFaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=QbWufbFIgUeqwexUWORu8Rnd_1J9sQq9FgmzsAbpYzg&s=LjLq1QmLjA9d7yG9DSMZTGcNpEfADCa-NmFJQXfj0lM&e=.

Dr. Edward J. Brash

Professor of Physics - Christopher Newport University Staff Scientist - Thomas Jefferson National Accelerator Facility Honorary Senior Research Fellow - University of Glasgow Office: 757-594-7451 Mobile: 757-753-2831 FAX: 757-594-7919

hansenjo commented 6 years ago

Well, that's good news. Being able to reproduce a bug is like the alcoholic admitting that he/she's got a problem ... the first step towards recovery ;)

Hopefully you'll be able to track it down. I've been pulling my hair out over this.

Ole

hansenjo commented 6 years ago

I am noticing another, possibly related difference between the Make-compiled and the SCons-compiled versions of the analyzer. If I log into a machine without X forwarding and then start the analyzer version compiled with make, I always get the familiar warning about DISPLAY not set:

[ole@archie analyzer]$ ./analyzer -v
Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 192.168.88.2:0.0
Podd 1.6.0-beta3 Linux-4.12.13-1-ARCH-x86_64 git @1bc2030 ROOT 6.10/04

Perhaps worth noting, I even get this warning when running with the (new) -v flag, which does not even create a THaInterface, but just runs a few cout commands in main() before exiting.

Now, if I do the same with the version compiled with SCons, the DISPLAY warning is never shown, even when starting a session where it normally would appear. No error appears when trying to open windows from such a session, for example

analyzer [0] auto b = new TBrowser

and no window appears anywhere.

It seems like the make-compiled version initializes some ROOT component that includes DISPLAY handling, while the SCons-compiled version doesn't. I am not sure if this is related to the interactive interpreter crashes, but it's certainly another indication of a significant difference between the build systems, and fixing one could fix the other as well.

brash99 commented 6 years ago

Hi Ole,

I had noticed this as well. This may be related to the fact that in the Makefile, the ROOT libraries that are linked to at compile time are defined with ‘root-config —glibs’, whereas in the SConstruct, it uses ‘root-config —libs’. The difference results in the analyzer being linked to libGui.so (-lGui) as well, using make. I updated the SConstruct so that the ROOT library list is now the same as for the Makefile. Unfortunately, that did not fix the problem. But, it is still good to find and fix these differences anyway.

I also found another “bug” in the SCons configure scripts. I was confused, apparently, about what the flags NDEBUG and WITH_DEBUG actually mean. I had thought that (as the names might suggest), that one would pass one of these when compiling in debug mode, and the other when not. But, now that I look into it, that is not what they mean. I see the in the Makefile, the standard is to pass both of these. I have updated the SCons configure scripts to do things as the Makefile does in this respect.

The effect of this change is that now the .o (which make creates) and .os files (which SCons creates) in the src/ directory are literally all identical to one another. For the hana_decode directory, this is not quite true, because make actually goes into that directory to do the compilation of the source files, whereas SCons does it from the main directory. This causes the .os and .o object files to be different from one another in a binary sense. But, I have verified by using the command line that the make and SCons compilation commands do produce identical object files to one another for the hana_decode directory as well.

With that said, this change did not fix the problem either. Sigh …

I’m continuing to look at it … just going through things systematically and eliminating possibilities at this point.

Cheers, E.

On Oct 1, 2017, at 1:59 PM, Ole Hansen notifications@github.com wrote:

I am noticing another, possibly related difference between the Make-compiled and the SCons-compiled versions of the analyzer. If I log into a machine without X forwarding and then start the analyzer version compiled with make, I always get the familiar warning about DISPLAY not set:

[ole@archie analyzer]$ ./analyzer -v Warning in : DISPLAY not set, setting it to 192.168.88.2:0.0 Podd 1.6.0-beta3 Linux-4.12.13-1-ARCH-x86_64 git @1bc2030 ROOT 6.10/04 Perhaps worth noting, I even get this warning when running with the (new) -v flag, which does not even create a THaInterface, but just runs a few cout commands in main() before exiting.

Now, if I do the same with the version compiled with SCons, the DISPLAY warning is never shown, even when starting a session where it normally would appear. No error appears when trying to open windows from such a session, for example

analyzer [0] auto b = new TBrowser and no window appears anywhere.

It seems like the make-compiled version initializes some ROOT component that includes DISPLAY handling, while the SCons-compiled version doesn't. I am not sure if this is related to the interactive interpreter crashes, but it's certainly another indication of a significant difference between the build systems, and fixing one could fix the other as well.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_analyzer_issues_141-23issuecomment-2D333394538&d=DwMFaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=SM8G6lEOarbbgGfL2QXSiScPC6plwCvJH_1o5DJ0e5g&s=GABIBehZwumsGCizzSmRirHwYLWApC6SE7rSIPgu6vM&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AE1Pu6j0DJdaQ2ZWsZWoKfFkW2IvB1Nsks5sn9MfgaJpZM4PeWMB&d=DwMFaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=SM8G6lEOarbbgGfL2QXSiScPC6plwCvJH_1o5DJ0e5g&s=myUlmMegAO_LhYyF5uLYbE7ndGGEBN3I96EMqDADLPQ&e=.

Dr. Edward J. Brash

Professor of Physics - Christopher Newport University Staff Scientist - Thomas Jefferson National Accelerator Facility Honorary Senior Research Fellow - University of Glasgow Office: 757-594-7451 Mobile: 757-753-2831 FAX: 757-594-7919

brash99 commented 6 years ago

I think that I have found and fixed the problem!

In the SCons build, the -fPIC flag was NOT being included in the building of src/main.C, whereas in the make build is was/is. The -fPIC flag is set in different ways in the two build systems, and the way that it was being done in SCons resulted in this inconsistency. The fix was pretty simple (just a small change to the configuration files).

I also fixed a couple of other inconsistencies (the way that the ROOT libraries were being defined, and the way that the -DNDEBUG and -DWITH_DEBUG flags were being set) … now SCons and make handle these in the same way.

I have tested on Centos7, and upon starting and stopping the analyzer about 20 times, and executing the c++11 code below, I have seen no segfaults. When I go back to the old SCons way, without the -fPIC flag in the src/main.C compilation, the segfault issue returns. So, I am moderately confident that this is the issue, and that it is now fixed.

I also updated the appropriate files in the SDK as well, and did a pull request of all of this.

Cheers, E.

On Oct 1, 2017, at 3:03 PM, Edward Brash brash99w@gmail.com wrote:

Hi Ole,

I had noticed this as well. This may be related to the fact that in the Makefile, the ROOT libraries that are linked to at compile time are defined with ‘root-config —glibs’, whereas in the SConstruct, it uses ‘root-config —libs’. The difference results in the analyzer being linked to libGui.so (-lGui) as well, using make. I updated the SConstruct so that the ROOT library list is now the same as for the Makefile. Unfortunately, that did not fix the problem. But, it is still good to find and fix these differences anyway.

I also found another “bug” in the SCons configure scripts. I was confused, apparently, about what the flags NDEBUG and WITH_DEBUG actually mean. I had thought that (as the names might suggest), that one would pass one of these when compiling in debug mode, and the other when not. But, now that I look into it, that is not what they mean. I see the in the Makefile, the standard is to pass both of these. I have updated the SCons configure scripts to do things as the Makefile does in this respect.

The effect of this change is that now the .o (which make creates) and .os files (which SCons creates) in the src/ directory are literally all identical to one another. For the hana_decode directory, this is not quite true, because make actually goes into that directory to do the compilation of the source files, whereas SCons does it from the main directory. This causes the .os and .o object files to be different from one another in a binary sense. But, I have verified by using the command line that the make and SCons compilation commands do produce identical object files to one another for the hana_decode directory as well.

With that said, this change did not fix the problem either. Sigh …

I’m continuing to look at it … just going through things systematically and eliminating possibilities at this point.

Cheers, E.

On Oct 1, 2017, at 1:59 PM, Ole Hansen <notifications@github.com mailto:notifications@github.com> wrote:

I am noticing another, possibly related difference between the Make-compiled and the SCons-compiled versions of the analyzer. If I log into a machine without X forwarding and then start the analyzer version compiled with make, I always get the familiar warning about DISPLAY not set:

[ole@archie analyzer]$ ./analyzer -v Warning in : DISPLAY not set, setting it to 192.168.88.2:0.0 Podd 1.6.0-beta3 Linux-4.12.13-1-ARCH-x86_64 git @1bc2030 ROOT 6.10/04 Perhaps worth noting, I even get this warning when running with the (new) -v flag, which does not even create a THaInterface, but just runs a few cout commands in main() before exiting.

Now, if I do the same with the version compiled with SCons, the DISPLAY warning is never shown, even when starting a session where it normally would appear. No error appears when trying to open windows from such a session, for example

analyzer [0] auto b = new TBrowser and no window appears anywhere.

It seems like the make-compiled version initializes some ROOT component that includes DISPLAY handling, while the SCons-compiled version doesn't. I am not sure if this is related to the interactive interpreter crashes, but it's certainly another indication of a significant difference between the build systems, and fixing one could fix the other as well.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_analyzer_issues_141-23issuecomment-2D333394538&d=DwMFaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=SM8G6lEOarbbgGfL2QXSiScPC6plwCvJH_1o5DJ0e5g&s=GABIBehZwumsGCizzSmRirHwYLWApC6SE7rSIPgu6vM&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AE1Pu6j0DJdaQ2ZWsZWoKfFkW2IvB1Nsks5sn9MfgaJpZM4PeWMB&d=DwMFaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=QQI4wgZ48DvzGull4QAPAA&m=SM8G6lEOarbbgGfL2QXSiScPC6plwCvJH_1o5DJ0e5g&s=myUlmMegAO_LhYyF5uLYbE7ndGGEBN3I96EMqDADLPQ&e=.

Dr. Edward J. Brash

Professor of Physics - Christopher Newport University Staff Scientist - Thomas Jefferson National Accelerator Facility Honorary Senior Research Fellow - University of Glasgow Office: 757-594-7451 Mobile: 757-753-2831 FAX: 757-594-7919

hansenjo commented 6 years ago

Great job, Ed! I had a hunch there was one tiny little detail at the bottom of this. I had played with compiler flags, but hadn't gotten to this one yet. And it explains neatly why hcana wasn't affected - it doesn't use Podd's main.C.

Thanks for the quick fix, and I'll put your changes into GitHub as soon as I can, before I make the next release.

Ole