Memory bug patching/ASAN

EinarElen commented 1 year ago

List of memory issues caught by ASAN/UBSAN and their fix-status

[ ] PN DQM missing &
[x] Ecal veto buffer overflow missing .size()
[x] https://github.com/LDMX-Software/Hcal/issues/58
[x] Ecal veto mapsx/mapsy
[ ] Undefined noise entry in calorimeter hits
[ ] Missing case in hcal scintillator orientation
[ ] SimSpecialID UB? Probably not worth dealing with

Describe the bug A missing & in one of the helper functions in the PN DQM code meant that the particle map was copied rather than referenced which made any pointers to simparticles therein invalid after the scope of the function. This bug hasn't had any impact on the DQM output, if it did we would have seen some difference. Regardless, we should of course patch it.

The culprit is here https://github.com/LDMX-Software/ldmx-sw/blob/0079ce00ff851fd427f2191c45c72d7857c37c17/DQM/include/DQM/PhotoNuclearDQM.h#L96-L110

Where the map is supposed to be passed in by const ref.

I caught this while trying to diagnose a completely different issue by running with address sanitizer enabled. I'm wondering if it would be useful to include these kinds of tools directly in our DQM jobs. At least ASAN has a minimal performance overhead. UBSAN could be interesting, but it doesn't by default kill a job if it catches something (it has some false positives) so you would only benefit from it if you read the logs.

EinarElen commented 1 year ago

Caught another one in the Ecal veto/geometry code. When we are trying to fill the isolated hit map, we assume the list of neighbouring cells that we get from the Ecal geometry object to always have 6 entries. However, I'm seeing some events where you get a smaller number of neighbours (e.g. 4). This makes cellNbrIds[k] = ldmx::EcalID(id.layer(), cellNbrIds[k].module(), cellNbrIds[k].cell()); do a buffer overflow.

I'm not sure if the assumption that there are always 6 neighbours is correct, if so there's another bug in some other place (e.g. the ecal geometry code). If not, the solution is to just loop over the size of the vector. @tomeichlersmith maybe you know?

https://github.com/LDMX-Software/Ecal/blob/bb668db60852b86d85eab1bce910c64be02b16e2/src/Ecal/EcalVetoProcessor.cxx#L907-L922

    // Skip hits that have a readout neighbor
    // Get neighboring cell id's and try to look them up in the full cell map
    // (constant speed algo.)
    //  these ideas are only cell/module (must ignore layer)
    std::vector<ldmx::EcalID> cellNbrIds = geometry_->getNN(id);

    for (int k = 0; k < 6; k++) {
      // update neighbor ID to the current layer
      cellNbrIds[k] = ldmx::EcalID(id.layer(), cellNbrIds[k].module(),
                                   cellNbrIds[k].cell());
      // look in cell hit map to see if it is there
      if (cellMap_.find(cellNbrIds[k]) != cellMap_.end()) {
        isolatedHit = std::make_pair(false, cellNbrIds[k]);
        break;
      }
    }

EinarElen commented 1 year ago

I am also seeing an overflow in

      dis[0] = faceXY[0] - mapsx[index + step];

Which I'm guessing has to be related to the mapsx access. Index and step are defined the following way

  int step = 0;
  std::vector<float>::iterator it;
  it = std::lower_bound(mapsx.begin(), mapsx.end(), faceXY[0]);

  index = std::distance(mapsx.begin(), it);
  if (index == mapsx.size()) {
    index += -1;
  }

So we can never have an issue because index is out of bounds. However, if index is max and step > 0 then this will also be an overflow.

Step is updated below

      if (up == 0) {
        step += 1;
      } else {
        step += -1;
      }

So, I'm guessing that we are entering the first branch here when index is max which then overflows (and the same will happen for the y-map). This could actually be a bit more of an issue than the previous ones, if we have bad luck then writing to index+step for mapsx could be writing into mapsy's memory

The relevant code is here https://github.com/LDMX-Software/Ecal/blob/bb668db60852b86d85eab1bce910c64be02b16e2/src/Ecal/EcalVetoProcessor.cxx#L492-L536

EinarElen commented 1 year ago

https://github.com/LDMX-Software/Hcal/issues/58 was also caught by this

tomeichlersmith commented 1 year ago

The assumption that there are always 6 neighbors is incorrect, there will be cells on the edge of the flowers that have less than 6 neighbors so ASAN has indeed found a bug for us. (hopefully only in memory and not in values).

EinarElen commented 1 year ago

Ok, the patch for that is straight-forward. I'm less sure about the mapsx/mapsy code. I'm not entirely sure what it is doing, so I can only patch it in a brute-force manner by checking the index + step values. Do you think this issue arises from the same kind of thing?

Following the discussion at the software developers meeting, I'll convert this to an issue about memory issues in general.

tomeichlersmith commented 1 year ago

Yea, I'm struggling to understand the mapsx/mapsy code as well - I think flagging it as an issue in the Ecal repo will allow it to be on my To-Do list (probably done at the same time I break up the enormouse EcalVetoProcessor).

EinarElen commented 1 year ago

Updating this to be about all the issues that need to be dealt with if we want to be able to require ASAN clean builds for running validation. Keeping a list of all issues I've identified at the top

EinarElen commented 1 year ago

I'm currently working on some testing for the Hcal and happened to run into a couple of new issues

getScintillatorOrientation in the HcalGeometry doesn't handle the back properly for v12/v13 (missing switch case)
The EcalRecProducer and HcalRecProducer doesn't set the isNoise member for the EcalHit/HcalHits. Since these aren't given a default value in the CalorimeterHit, that means the isNoise member ends up containing junk which are virtually guaranteed to correspond to true. Notably, this actually shows up in the output from the HCal test suite which has a CHECK on isNoise that fails but this doesn't show up in the ctest output :(
Extremely minor and probably not worth fixing, but the SimSpecialID ssid2(SimSpecialID::SimSpecialType(9), 0xFEDCB); code in the DetectorID tests is undefined behavior.

EinarElen commented 1 year ago

Lastly, ASAN reports a bunch of use-after-free in the Framework tests. These seem to be related to conditions that have already been destructed. Here's a very condensed version of one of those reports (most of the catch internals removed)

==17==ERROR: AddressSanitizer: heap-use-after-free on address 0x60c000000dc0 at pc 0x55655e7eb1a2 bp 0x7fff47402580 sp 0x7fff47402570
READ of size 8 at 0x60c000000dc0 thread T0
    #0 0x55655e7eb1a1 in void conditions::test::matchesMeta<conditions::DoubleTableCondition>(conditions::DoubleTableCondition const&, conditions::DoubleTableCondition const&) /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:27
    #1 0x55655e7c237a in conditions::test::matchesAll(conditions::DoubleTableCondition const&, conditions::DoubleTableCondition const&) /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:32

0x60c000000dc0 is located 0 bytes inside of 128-byte region [0x60c000000dc0,0x60c000000e40)
freed by thread T0 here:
    #0 0x7f60155c322f in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x55655e7e646c in conditions::DoubleTableCondition::~DoubleTableCondition() /home/einarelen/ldmx/ldmx-sw/Conditions/include/Conditions/SimpleTableCondition.h:193
    #2 0x7f601421c270 in framework::ConditionsObjectProvider::releaseConditionsObject(framework::ConditionsObject const*) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/ConditionsObjectProvider.h:90
    #3 0x7f6010050ebe in framework::Conditions::getConditionPtr(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/src/Framework/Conditions.cxx:91
    #4 0x55655e7f6739 in conditions::DoubleTableCondition const& framework::Conditions::getCondition<conditions::DoubleTableCondition>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/Conditions.h:84

previously allocated by thread T0 here:
    #0 0x7f60155c21c7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7f6010d6c967 in conditions::SimpleCSVTableProvider::getCondition(ldmx::EventHeader const&) /home/einarelen/ldmx/ldmx-sw/Conditions/src/Conditions/SimpleCSVTableProvider.cxx:190
    #2 0x7f60100502c2 in framework::Conditions::getConditionPtr(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/src/Framework/Conditions.cxx:69
    #3 0x55655e7f6739 in conditions::DoubleTableCondition const& framework::Conditions::getCondition<conditions::DoubleTableCondition>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/Conditions.h:84
    #4 0x55655e7cc8b1 in CATCH2_INTERNAL_TEST_0 /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:225

@tomeichlersmith I know you dealt with a bunch of issues with creating multiple instances of the Process object when working on the docker upgrade. Do you think this could be a similar issue but for conditions?

EinarElen commented 1 year ago

Hmm, might not be the same kind of issue. Tried running the tests one by one and you only get the problem for the Conditions test

EinarElen commented 1 year ago

One last thing, I keep seeing TypeError: 'float' object cannot be interpreted as an integer at the start of the process. I'm not sure where this comes from or what it is about.

EinarElen commented 1 year ago

In good news, after patching all of the above (hacking around the mapsx/mapsy-issue), we are ASAN-safe and mostly UBSAN-safe. Tried all of the validation configs

tomeichlersmith commented 1 year ago

The TypeError: 'float' object cannot be interpreted as an integer originates in ConfigurePython. I am unsure on where it is coming from however since it is not crashing the program either.

For the use-after-free issues, I'm guessing that it has something to do with the tests where we load conditions via the ConfigurePython->Process pipeline and then test them.

https://github.com/LDMX-Software/Conditions/blob/4056cf8684bd6eea0e13e4ab3e24958ccf8124e3/test/TestConditions.cxx#L182-L195

tvami commented 3 months ago

Ecal veto buffer overflow missing .size()

I put the tick in for this as this was fixed already in https://github.com/LDMX-Software/ldmx-sw/commit/5c6bee2126998962e6af45bae2b4bd0498c4b32f

tvami commented 3 months ago

@EinarElen, when you have some time can you send instruction on how you ran ASAN?

EinarElen commented 3 months ago

You need to build ldmx-sw with some additional cmake commands, in particular

-DENABLE_SANITIZER_ADDRESS=ON

There is some additional documentation in the cmake module, I think you'll want the stuff about recovering on error

EinarElen commented 3 months ago

https://github.com/LDMX-Software/ldmx-sw/blob/trunk/cmake%2FSanitizers.cmake

tvami commented 2 weeks ago

Hey @EinarElen I'm back to this a bit again, I ran

ldmx cmake -B build -S . -DENABLE_SANITIZER_ADDRESS=ON
ldmx cmake --build build --target install -- -j$(nproc)

and it didnt show anything new. Are you doing something else too?

EinarElen commented 1 week ago

It is a runtime thing, running with the sanitizer settings when you build enables it. So just run your simulation and it should crash at some point :)

tvami commented 1 week ago

OK so good news it that I can ran this. Bad news is that the first thing I see, I have very little idea on what to do with. The only thing I understand is that there is potentially a bug in the EcalVeto. This is what I see

=7==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6290002761fc at pc 0x7f0df47a8a55 bp 0x7ffd2ab849b0 sp 0x7ffd2ab849a0
READ of size 4 at 0x6290002761fc thread T0
    #0 0x7f0df47a8a54 in ecal::EcalVetoProcessor::produce(framework::Event&) (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libEcal.so+0x1a8a54)
    #1 0x7f0e01b8444a in framework::Process::process(int, framework::Event&) const (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libFramework.so+0x18444a)
    #2 0x7f0e01b8b6a6 in framework::Process::run() (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libFramework.so+0x18b6a6)
    #3 0x5573a5fe23a6 in main (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/bin/fire+0x93a6)
    #4 0x7f0dff229d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #5 0x7f0dff229e3f in __libc_start_main_impl ../csu/libc-start.c:392
    #6 0x5573a5fe2b54 in _start (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/bin/fire+0x9b54)

0x6290002761fc is located 4 bytes to the left of 16384-byte region [0x629000276200,0x62900027a200)
allocated by thread T0 here:
    #0 0x7f0e01eb61e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7f0e00e3a27d in void std::vector<float, std::allocator<float> >::_M_realloc_insert<float const&>(__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, float const&) (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libSimCore_Event.so+0x427d)

SUMMARY: AddressSanitizer: heap-buffer-overflow (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libEcal.so+0x1a8a54) in ecal::EcalVetoProcessor::produce(framework::Event&)
Shadow bytes around the buggy address:
  0x0c5280046be0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046bf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046c00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c5280046c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
  0x0c5280046c40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc

@EinarElen any ideas on how to extract more info on what's going on? Alternatively I'll do some couts here and there

EinarElen commented 1 week ago

Usually it is worth to run a debug build to get a bit more info out of it, but regardless it still tells you that there is a buffer overflow in EcalVetoProcessor::produce.

AFAIK it is this

https://github.com/LDMX-Software/ldmx-sw/issues/1166#issuecomment-1570187979

tvami commented 1 week ago

Ok I did a few couts, indeed it's the one with maps indices.

A good event

 index 1304  step 0
dis[0]: -3.40018
dis[1]: 74.6287

 index 1304  step 1
dis[0]: -3.40018
dis[1]: 59.6287

 index 1304  step 2
dis[0]: -3.40018
dis[1]: 44.6287

 index 1304  step 3
dis[0]: -3.40018
dis[1]: 29.6287

 index 1304  step 4
dis[0]: -3.40018
dis[1]: 14.6287

 index 1304  step 5
dis[0]: -3.40018
dis[1]: -0.371328

the problematic event

 index 0  step 0
dis[0]: -19.848
dis[1]: 96.0028

 index 0  step -1
>> breaks here

Then I find that the faceXY[0]: -262.335 which is outside of the range and thus results to the 0 index.

Another example, is where faceXY = 290.302 and

 recoilPos[0] -276.52 recoilP[0] -60.2409 recoilP[2] 16.6094

so already he recoilPos is outside the ECAL

tvami commented 4 days ago

OK resolved the

Ecal veto mapsx/mapsy

in https://github.com/LDMX-Software/ldmx-sw/pull/1410

I have student Ananda who'll look into the noise generation, so in that PR I think we can take the

Undefined noise entry in calorimeter hits

part.

I'll prob have the DQM one on my next todo list.

As for the other two points, I dont know the HCAL enough to fix that one, while for the test, do you have another suggestion to test instead @EinarElen ?

LDMX-Software / ldmx-sw

Memory bug patching/ASAN #1166