Closed ayllon closed 7 months ago
Maybe related #466 ?
Jose (sysadmin@SDC-DE) fixed a system problem that he thinks could cause this crash. We are running a check right now. Let's hope...
Ah! What was it?
I am also running now with thread
sanitizer, but it does not complain either.
Dear Martin, I just saw that the user that launch the pilots in the SDC-DE PROD runned out of quota ... sorry, this happens maybe once every 1 or 2 years and I didn't checked this time until now.
@ayllon was on cc.. But I don't see why it would fail successively reducing the list of output-properties but then work again with the original, small list all due to a quota problem...
The test It did not work out and we still get crashes with -11.
We need to figure out a way to get a core dump.
I don't think that Alexandria is the problem.
In the Euclid detection I use moffat grouping, and I think this processing grabs the many cores. Interestingly it is use-cleaning=0, which would mean the gouping does have no effect. (but I tested it when deciding to use it...).
Another maybe weak point is that we still use develop 0.18 which is quite old (end of April). I think I/we should switch. What makes more sense? 0.18.0 or 0.19?
I finally managed to get (random) crashes when running on EDEN using test_multi_modelfitting.py
.
The crash may not happen, but from time to time it happens inside shiftResizeLancszosFast
#0 0x00007ffff7c45d95 in SourceXtractor::shiftResizeLancszosFast(std::shared_ptr<SourceXtractor::VectorImage<float> > const&, std::shared_ptr<SourceXtractor::VectorImage<float> >&, double, double, double) () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#1 0x00007ffff7c46b45 in ModelFitting::ImageTraits<std::shared_ptr<SourceXtractor::VectorImage<float> > >::addImageToImage(std::shared_ptr<SourceXtractor::VectorImage<float> >&, std::shared_ptr<SourceXtractor::VectorImage<float> > const&, double, double, double) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#2 0x00007ffff7d02c1d in void ModelFitting::_impl::addExtendedModels<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > > >(std::shared_ptr<SourceXtractor::VectorImage<float> >&, std::vector<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > >, std::allocator<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > > > > const&, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > >&, double) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#3 0x00007ffff7d031d4 in ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >::rasterToImage(std::shared_ptr<SourceXtractor::VectorImage<float> >&) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#4 0x00007ffff7d0362e in ModelFitting::DataVsModelResiduals<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::AsinhChiSquareComparator>::populateResidualBlock(double*) () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#5 0x00007ffff6c7c048 in ModelFitting::ResidualEstimator::populateResiduals(double*) const ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#6 0x00007ffff596933f in dlevmar_fdif_forw_jac_approx () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#7 0x00007ffff59636e9 in dlevmar_dif () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#8 0x00007ffff6c7b7fd in ModelFitting::LevmarEngine::solveProblem(ModelFitting::EngineParameterManager&, ModelFitting::ResidualEstimator&) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.17.0/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#9 0x00007ffff7cff8b6 in SourceXtractor::MoffatModelFittingTask::computeProperties(SourceXtractor::SourceInterface&) const ()
or inside pow()
(?!)
#0 0x00007ffff5fc6440 in __ieee754_pow_sse2 () from /lib64/libm.so.6
#1 0x00007ffff6c7db61 in ModelFitting::FlattenedMoffatComponent::getValue(double, double) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#2 0x00007ffff7d03522 in void ModelFitting::_impl::addExtendedModels<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > > >(std::shared_ptr<SourceXtractor::VectorImage<float> >&, std::vector<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > >, std::allocator<std::shared_ptr<ModelFitting::ExtendedModel<std::shared_ptr<SourceXtractor::VectorImage<float> > > > > > const&, ModelFitting::FrameModelPsfContainer<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > > >&, double) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#3 0x00007ffff7d03c74 in ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >::rasterToImage(std::shared_ptr<SourceXtractor::VectorImage<float> >&) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#4 0x00007ffff7d040ce in ModelFitting::DataVsModelResiduals<std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::FrameModel<ModelFitting::NullPsf<std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> > >, std::shared_ptr<SourceXtractor::VectorImage<float> >, ModelFitting::AsinhChiSquareComparator>::populateResidualBlock(double*) () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#5 0x00007ffff6c7d048 in ModelFitting::ResidualEstimator::populateResiduals(double*) const ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#6 0x00007ffff596a33f in dlevmar_fdif_forw_jac_approx () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#7 0x00007ffff59646e9 in dlevmar_dif () from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/lib/liblevmar.so.2.6
#8 0x00007ffff6c7c7fd in ModelFitting::LevmarEngine::solveProblem(ModelFitting::EngineParameterManager&, ModelFitting::ResidualEstimator&) ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libModelFitting.so
#9 0x00007ffff7d00356 in SourceXtractor::MoffatModelFittingTask::computeProperties(SourceXtractor::SourceInterface&) const ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEImplementation.so
#10 0x00007ffff77537ac in SourceXtractor::SourceWithOnDemandProperties::getProperty(SourceXtractor::PropertyId const&) const ()
from /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/opt/euclid/SourceXtractorPlusPlus/0.19/InstallArea/x86_64-conda_cos6-gcc93-o2g/lib/libSEFramework.so
#11 0x00007ffff7d04bc5 in SourceXtractor::MoffatModelEvaluatorTask::computeProperties(SourceXtractor::SourceInterface&) const ()
The fact that pow
crashes makes me think there is stack corruption. MoffatFitting is on the backtrace in both situations, which is suspicious.
SourceXtractor 0.17.0 is affected, by the way, so this is not new.
Moffat grouping has not been used frequently, I guess that's the reason this was not discovered earlier.
This only happens in EDEN, though. We run this test for each pull request, release, etc... and it didn't trigger.
I am starting to suspect the Moffat fitting gives some bad values (i.e. a completely bonkers, or plainly invalid, x
or y
fit), and we end writing outside the image buffer.
It may have something to do with the blas implementation? EDEN is openblas and we run the tests with lapack.
I think my RAM is fried. I get random gcc crashes, random sx crashes, and random file corruption.
For instance my SourceXtractorEnvironment.xml has this
<env:prepend variable="LD_LIBRARY_PATH">/cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-3.0/x86_64-conda-linux-gnu/sysroot/usrÿÎm*</env:prepend>
Not helping to debug this, not helping...
Thats solved since some time now.
The problem was the competing parallelization in SE++ and openMP.
However, it does not crash on SDC-CH with the same input and configuration and binaries (cvmfs)
The crash can be a
SIGSEGV
or aSIGBUS
. The former is odd.undefined
andaddress
sanitizers do not complain when run over a multiframe detection+fit (the one from the litmus tests).I create this issue to keep track of this problem.