Elements Exception : Property SourceXtractor::FlexibleModelFitting [ 0 ] not found!

astrorama / SourceXtractorPlusPlus

SourceXtractor++, the next generation SExtractor

https://astrorama.github.io/SourceXtractorPlusPlus/

GNU Lesser General Public License v3.0

72 stars 9 forks source link

Elements Exception : Property SourceXtractor::FlexibleModelFitting [ 0 ] not found! #369

Closed mkuemmel closed 3 years ago

mkuemmel commented 3 years ago

Running Disk+Sersic fitting on the entire 19x19k Euclid dataset I get this error below (crash1.txt). First I thought its kind of a glitch in the OS or the file systyem. But its resproducible (-->crash2.txt) crash1.txt crash2.txt

At that point of the processing, after 24h, the output table contains 24k extractions with FlexibleModelFitting , so I don't really understand.

ayllon commented 3 years ago

That stack trace is weird. I do not think it makes sense to have that exception coming from saveTile The fact that there is a shared_ptr destructor on the stack trace makes me think there is a race condition here with the check images, similar to the one we had already fixed when reading.

WriteableBufferedImage still has the m_current_tile that was problematic.

ayllon commented 3 years ago

Nervermind, it is protected by a mutex. Odd...

mkuemmel commented 3 years ago

From the timestamps I see that it was updating/creating some model check images when that happened. So that's likely where it came from.

Initially I though it's a hickup, also sicne it happened after about 24h.

Note that I locally merged feature/inplace_dft into develop to have the smallest RAM footprint. But that should not be important here.

Another indication that there is still a problem (race condition, file access) is I had another crash without stacktrace yesterday when starting a Disk+Bulge run. It happened much earlier about when starting the first measurement.

ayllon commented 3 years ago

Tsan does not report any race condition that could trigger this. It says something about openblas and dlevmar_dif, but that's probably a false positive/

mkuemmel commented 3 years ago

Over night the Disk+Bulge processing finished also with a crash and the same error line. Also here the last thing it did or wanted to do was updating a model image. crash3.txt

mkuemmel commented 3 years ago

What about taking the error message at face value, in the sense that a model image is updated and for at least one source the property FlexibleModelFitting with the model parameters 'got lost'?

Why that should happen I can only speculate, maybe the object was already written to the output or there is a memory issue?

ayllon commented 3 years ago

In the past I have seen glitches where computing a property would fail, and the program get "confused" and not being able to tell apart failures from a property not being found, indeed. What confuses me is the back trace. That exception can not be thrown from there.

mkuemmel commented 3 years ago

Without checkimages the Disk+Sersic fitting run is over the hump and already at 87% finished! Then we also get a complete diagram on the RAM usage.

Afterwards I'll then do a run Disk+Bulge.

mkuemmel commented 3 years ago

The Disk+Bulge run is over the hump as well and has finished to 75% after 2.5 days. Tomorrow morning it should be finished and we can compare the RAM consumption and so on for the two big runs.

mkuemmel commented 3 years ago

The Disk+Bulge run is chewing on the last object alone since about 12h. no idea since when it is already in the pool and when it is going to finish...

mkuemmel commented 3 years ago

Disk+Bulge did not finish clean, I had to break before finishing the last object.

But also there without the checkimages there isn't this problem.

mkuemmel commented 3 years ago

Pushed the tarball problem369.tgz with all necessary files to irods.

mkuemmel commented 3 years ago

Interestingly #381 seems to solve this issue. It ran through. All the model and residual images do exist.

Halfway through the run I started monitoring the RAM consumption, which looks this way: some_run

The entire run took ~49 hours, this means the monitoring is missing the first 27h.

mkuemmel commented 3 years ago

It did run through, so I'll be closing this one.