Closed mkuemmel closed 3 years ago
That stack trace is weird. I do not think it makes sense to have that exception coming from saveTile
The fact that there is a shared_ptr
destructor on the stack trace makes me think there is a race condition here with the
check images, similar to the one we had already fixed when reading.
WriteableBufferedImage
still has the m_current_tile
that was problematic.
Nervermind, it is protected by a mutex. Odd...
From the timestamps I see that it was updating/creating some model check images when that happened. So that's likely where it came from.
Initially I though it's a hickup, also sicne it happened after about 24h.
Note that I locally merged feature/inplace_dft into develop to have the smallest RAM footprint. But that should not be important here.
Another indication that there is still a problem (race condition, file access) is I had another crash without stacktrace yesterday when starting a Disk+Bulge run. It happened much earlier about when starting the first measurement.
Tsan does not report any race condition that could trigger this. It says something about openblas
and dlevmar_dif
, but that's probably a false positive/
Over night the Disk+Bulge processing finished also with a crash and the same error line. Also here the last thing it did or wanted to do was updating a model image. crash3.txt
What about taking the error message at face value, in the sense that a model image is updated and for at least one source the property FlexibleModelFitting with the model parameters 'got lost'?
Why that should happen I can only speculate, maybe the object was already written to the output or there is a memory issue?
In the past I have seen glitches where computing a property would fail, and the program get "confused" and not being able to tell apart failures from a property not being found, indeed. What confuses me is the back trace. That exception can not be thrown from there.
Without checkimages the Disk+Sersic fitting run is over the hump and already at 87% finished! Then we also get a complete diagram on the RAM usage.
Afterwards I'll then do a run Disk+Bulge.
The Disk+Bulge run is over the hump as well and has finished to 75% after 2.5 days. Tomorrow morning it should be finished and we can compare the RAM consumption and so on for the two big runs.
The Disk+Bulge run is chewing on the last object alone since about 12h. no idea since when it is already in the pool and when it is going to finish...
Disk+Bulge did not finish clean, I had to break before finishing the last object.
But also there without the checkimages there isn't this problem.
Pushed the tarball problem369.tgz with all necessary files to irods.
Interestingly #381 seems to solve this issue. It ran through. All the model and residual images do exist.
Halfway through the run I started monitoring the RAM consumption, which looks this way:
The entire run took ~49 hours, this means the monitoring is missing the first 27h.
It did run through, so I'll be closing this one.
Running Disk+Sersic fitting on the entire 19x19k Euclid dataset I get this error below (crash1.txt). First I thought its kind of a glitch in the OS or the file systyem. But its resproducible (-->crash2.txt) crash1.txt crash2.txt
At that point of the processing, after 24h, the output table contains 24k extractions with FlexibleModelFitting , so I don't really understand.