Closed mkuemmel closed 2 years ago
Assigned to @marcschefer and @ayllon. An according email was blocked.
Well, it runs fine on my machine...
Well, it runs fine on my machine... Well, that's kind of what I feared.
While being on a telecon, I ran it twice with v0.16. It finished both times. The first time there was the message at the very end:
###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
Not sure whether this is helpful.
Okay, this happens on the test machine eucgu94@SDC-DE. Alejandro has an account on this system.
Maybe these problems happen on a LODEEN-3.0 VM as well.
Hi @mkuemmel,
A random thought: it looks like your script is running on a PBS system. Could it be possible that you hit a default limit (time, memory) of the submitted job? That could explain the strange message.
Those PBS settings come from the user. I added a '#' to invalidate them, and I am running a simple shell script. Deleting those lines does not change anything.
I tried the example locally on my laptop, and it runs fine as well.
It also ran on a v0.17 conda installation. I am not sure what to say. Guess we'll discuss tomorrow morning.
###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
This seems to be coming from Firefox? Probably not (directly) related.
2022-06-01T12:00:57CEST SourceXtractor ERROR : Failed to process the frame! Property SourceXtractor::DetectionFrameInfo [ 0 ] not found!
I have seen this error being sometimes misleading. i.e. SourceXtractor::DetectionFrameInfo
throws, and the plugin system "gets confused" and report that it can not be found.
It can be that I/O is flaky or something of the sort.
I am going to try and force errors inside that property and see if I can track where does the error go missing.
Some findings of today:
2022-06-02T09:31:32CEST SourceXtractor ERROR : Failed to process the frame! std::bad_cast
and 2022-06-02T11:12:56CEST SourceXtractor ERROR : Failed to process the frame! _Map_base::at
--thread-count 0
)LAPACK error: dgesdd (dbdsdc)/dgesvd (dbdsqr) failed to converge in dlevmar_pseudoinverse() [info=8]
at different places of the detection process;After a long discussion in the telecon today the conclusions were:
Alejandro and Marc are going to make tests on an EDEN-3.0 system to reproduce the problem.
The last tidbit from my side, I ran the data on two other systems:
There are no problems on my laptop and the 'usual' problems above on the EDEN-3.0 docker image. This strengthens that:
I pinged Frederic Leroux this morning, but I guess these people react to issues only...
I managed to reproduce the issue this morning in lodeen. However, using gdb didn't help since there are many exceptions raised as part of the normal flow. They are well handled, but obscure the issue. Ironically I have changed this so the normal flow does not depend on exceptions and the issue went away. I may need to run more times see if it comes back.
- 2022-06-02T11:12:56CEST SourceXtractor ERROR : Failed to process the frame! _Map_base::at
I have accidentally caused this exception when modifying other parts of the code. I think it comes from https://github.com/astrorama/SourceXtractorPlusPlus/blob/87ad730142d1fb1deeb1bb797c9f560d1a240c00/SEFramework/src/lib/Task/TaskProvider.cpp#L33-L53
m_task_factory_registry->getFactory
tries to get the property from a std::map
, which throws that exception if the key is not present.
I suspect perhaps a race condition? That would explain why most exceptions seem to originate from the same area (get a property). DetectionFrameInfo
is not handled by any plugin, so that would explain the exception from std::map::at
when looking for a task.
Depending on how things interleave, you get an error on one place, another, or runs just fine. That would explain why single threaded things work.
That code was already in v0,17?
Is there an explanation why no one fell over that earlier?
That code was already in v0,17? Is there an explanation why no one fell over that earlier?
No idea, maybe something else changed that made this more likely.
Btw, I just got a segfault from DetectionFrameInfoTask.cpp
when trying to set the property.
I am 99% sure now this is a race condition, because I can see notifyObservers
on the stack trace.
Is there an explanation why no one fell over that earlier?
Ah, I understand now. This line is the difference:
That triggers the computation of DetectionFrameInfo
on the thread that is doing the segmentation. By then, the source has gone into the prefetcher / measurement thread pool, so you have a race condition between two threads trying to set the properties of the same source. Probably even the same property.
The most obvious solution is to throw a mutex to it, but I am concerned that the "constness" of the object and the getProperty
methods are misleading.
Contention will not likely be a problem, though, since most of the time only one thread accesses the object, but then we are paying locking/unlocking of a mutex every time we set/get a property.
P.S The shared_ptr
is const, not the object it points to.
I can confirm that #480 does solve this problem.
I ran the problematic dataset with different # of cores on the SDC machine:
# of cores time 1 00:08:57 4 00:08:55 8 00:11:12
On my laptop things are a bit different:
# of cores time_#480 time_buggy 1 00:08:17 00:08:25 2 00:07:28 00:07:00 4 00:06:48 00:06:02
So on my laptop multi-threading makes sense. The bugfix costs performance, on the other hand even when the buggy code ran through its not clear whether the result makes sense.
The bugfix costs performance
That is to be expected, but it is the smallest change that fixes the issue.
Another option is to split the Observable
/Observer
pieces. i.e., right now it is a one-to-many, but we would need a one-to-one, where once the source is passed to the next step, the ownership is lost. For instance, kind of a Pipeline
interface.
We could avoid a mutex
in this way. I think it should be relatively straight-forward, but there may be ramifications.
There is a data set which keeps dumping away or ending with a strange error such as:
2022-06-01T12:00:57CEST SourceXtractor ERROR : Failed to process the frame! Property SourceXtractor::DetectionFrameInfo [ 0 ] not found!
or:2022-06-01T12:55:38CEST SourceXtractor ERROR : Failed to process the frame! _Map_base::at
Sometimes it just runs though. I tested with v0.17 and v0.18 on an EDEN-3.0 machine.
The data is here: https://deepdip.iap.fr/#folder/62974b6b6c26b9d0d2c5e1d6
The relevant script is: ./test2_works.sh