astrorama / SourceXtractorPlusPlus

SourceXtractor++, the next generation SExtractor
https://astrorama.github.io/SourceXtractorPlusPlus/
GNU Lesser General Public License v3.0
72 stars 9 forks source link

SourceXtractor++ crashes on SDC-DE #493

Closed ayllon closed 7 months ago

ayllon commented 2 years ago

However, it does not crash on SDC-CH with the same input and configuration and binaries (cvmfs)

The crash can be a SIGSEGV or a SIGBUS. The former is odd.

undefined and address sanitizers do not complain when run over a multiframe detection+fit (the one from the litmus tests).

I create this issue to keep track of this problem.

ayllon commented 2 years ago

Maybe related #466 ?

mkuemmel commented 2 years ago

466 happens early. This one here crashes when some objects are already processed (typically ~30%, but it's not fixed) .

Jose (sysadmin@SDC-DE) fixed a system problem that he thinks could cause this crash. We are running a check right now. Let's hope...

ayllon commented 2 years ago

Ah! What was it? I am also running now with thread sanitizer, but it does not complain either.

mkuemmel commented 2 years ago

Dear Martin, I just saw that the user that launch the pilots in the SDC-DE PROD runned out of quota ... sorry, this happens maybe once every 1 or 2 years and I didn't checked this time until now.

@ayllon was on cc.. But I don't see why it would fail successively reducing the list of output-properties but then work again with the original, small list all due to a quota problem...

mkuemmel commented 2 years ago

The test It did not work out and we still get crashes with -11.

We need to figure out a way to get a core dump.

mkuemmel commented 2 years ago

I don't think that Alexandria is the problem.

In the Euclid detection I use moffat grouping, and I think this processing grabs the many cores. Interestingly it is use-cleaning=0, which would mean the gouping does have no effect. (but I tested it when deciding to use it...).

Another maybe weak point is that we still use develop 0.18 which is quite old (end of April). I think I/we should switch. What makes more sense? 0.18.0 or 0.19?

ayllon commented 2 years ago

I finally managed to get (random) crashes when running on EDEN using test_multi_modelfitting.py.

The crash may not happen, but from time to time it happens inside shiftResizeLancszosFast

#0  0x00007ffff7c45d95 in SourceXtractor::shiftResizeLancszosFast(std::shared_ptr<SourceXtractor::VectorImage<float> > const&, std::shared_ptr<SourceXtractor::VectorImage<float> >&, double, double, double) () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#1  0x00007ffff7c46b45 in ModelFitting::ImageTraits<std::shared_ptr<SourceXtractor::VectorImage<float> > >::addImageToImage(std::shared_ptr<SourceXtractor::VectorImage<float> >&, std::shared_ptr<SourceXtractor::VectorImage<float> > const&, double, double, double) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#2  0x00007ffff7d02c1d in void ModelFitting::_impl::addExtendedModels<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > > >(std::shared_ptr<SourceXtractor::VectorImage<float> >&, std::vector<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > >, std::allocator<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > > > > const&, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > >&, double) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#3  0x00007ffff7d031d4 in ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >::rasterToImage(std::shared_ptr<SourceXtractor::VectorImage<float> >&) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#4  0x00007ffff7d0362e in ModelFitting::DataVsModelResiduals<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::AsinhChiSquareComparator>::populateResidualBlock(double*) () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#5  0x00007ffff6c7c048 in ModelFitting::ResidualEstimator::populateResiduals(double*) const ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#6  0x00007ffff596933f in dlevmar_fdif_forw_jac_approx () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#7  0x00007ffff59636e9 in dlevmar_dif () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#8  0x00007ffff6c7b7fd in ModelFitting::LevmarEngine::solveProblem(ModelFitting::EngineParameterManager&, ModelFitting::ResidualEstimator&) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#9  0x00007ffff7cff8b6 in SourceXtractor::MoffatModelFittingTask::computeProperties(SourceXtractor::SourceInterface&) const ()

or inside pow() (?!)

#0  0x00007ffff5fc6440 in __ieee754_pow_sse2 () from /lib64/libm.so.6
#1  0x00007ffff6c7db61 in ModelFitting::FlattenedMoffatComponent::getValue(double, double) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#2  0x00007ffff7d03522 in void ModelFitting::_impl::addExtendedModels<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > > >(std::shared_ptr<SourceXtractor::VectorImage<float> >&, std::vector<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > >, std::allocator<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > > > > const&, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > >&, double) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#3  0x00007ffff7d03c74 in ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >::rasterToImage(std::shared_ptr<SourceXtractor::VectorImage<float> >&) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#4  0x00007ffff7d040ce in ModelFitting::DataVsModelResiduals<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::AsinhChiSquareComparator>::populateResidualBlock(double*) () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#5  0x00007ffff6c7d048 in ModelFitting::ResidualEstimator::populateResiduals(double*) const ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#6  0x00007ffff596a33f in dlevmar_fdif_forw_jac_approx () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#7  0x00007ffff59646e9 in dlevmar_dif () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#8  0x00007ffff6c7c7fd in ModelFitting::LevmarEngine::solveProblem(ModelFitting::EngineParameterManager&, ModelFitting::ResidualEstimator&) ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#9  0x00007ffff7d00356 in SourceXtractor::MoffatModelFittingTask::computeProperties(SourceXtractor::SourceInterface&) const ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#10 0x00007ffff77537ac in SourceXtractor::SourceWithOnDemandProperties::getProperty(SourceXtractor::PropertyId const&) const ()
   from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEFramework.so
#11 0x00007ffff7d04bc5 in SourceXtractor::MoffatModelEvaluatorTask::computeProperties(SourceXtractor::SourceInterface&) const ()

The fact that pow crashes makes me think there is stack corruption. MoffatFitting is on the backtrace in both situations, which is suspicious.

SourceXtractor 0.17.0 is affected, by the way, so this is not new.

mkuemmel commented 2 years ago

Moffat grouping has not been used frequently, I guess that's the reason this was not discovered earlier.

ayllon commented 2 years ago

This only happens in EDEN, though. We run this test for each pull request, release, etc... and it didn't trigger.

I am starting to suspect the Moffat fitting gives some bad values (i.e. a completely bonkers, or plainly invalid, x or y fit), and we end writing outside the image buffer.

It may have something to do with the blas implementation? EDEN is openblas and we run the tests with lapack.

ayllon commented 2 years ago

I think my RAM is fried. I get random gcc crashes, random sx crashes, and random file corruption.

For instance my SourceXtractorEnvironment.xml has this

<env:prepend variable="LD_LIBRARY_PATH">/cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/x86_64-conda-linux-gnu/sysroot/usrÿÎm*</env:prepend>

Not helping to debug this, not helping...

mkuemmel commented 7 months ago

Thats solved since some time now.

The problem was the competing parallelization in SE++ and openMP.