astrorama / SourceXtractorPlusPlus

SourceXtractor++, the next generation SExtractor
https://astrorama.github.io/SourceXtractorPlusPlus/
GNU Lesser General Public License v3.0
72 stars 9 forks source link

SE++ dumps away with/without strange errors #478

Closed mkuemmel closed 2 years ago

mkuemmel commented 2 years ago

There is a data set which keeps dumping away or ending with a strange error such as: 2022-06-01T12:00:57CEST SourceXtractor ERROR : Failed to process the frame! Property SourceXtractor::DetectionFrameInfo [ 0 ] not found! or: 2022-06-01T12:55:38CEST SourceXtractor ERROR : Failed to process the frame! _Map_base::at

Sometimes it just runs though. I tested with v0.17 and v0.18 on an EDEN-3.0 machine.

The data is here: https://deepdip.iap.fr/#folder/62974b6b6c26b9d0d2c5e1d6

The relevant script is: ./test2_works.sh

mkuemmel commented 2 years ago

Assigned to @marcschefer and @ayllon. An according email was blocked.

marcschefer commented 2 years ago

Well, it runs fine on my machine...

mkuemmel commented 2 years ago

Well, it runs fine on my machine... Well, that's kind of what I feared.

While being on a telecon, I ran it twice with v0.16. It finished both times. The first time there was the message at the very end: ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost

Not sure whether this is helpful.

mkuemmel commented 2 years ago

Okay, this happens on the test machine eucgu94@SDC-DE. Alejandro has an account on this system.

Maybe these problems happen on a LODEEN-3.0 VM as well.

degauden commented 2 years ago

Hi @mkuemmel,

A random thought: it looks like your script is running on a PBS system. Could it be possible that you hit a default limit (time, memory) of the submitted job? That could explain the strange message.

mkuemmel commented 2 years ago

Those PBS settings come from the user. I added a '#' to invalidate them, and I am running a simple shell script. Deleting those lines does not change anything.

degauden commented 2 years ago

I tried the example locally on my laptop, and it runs fine as well.

mkuemmel commented 2 years ago

It also ran on a v0.17 conda installation. I am not sure what to say. Guess we'll discuss tomorrow morning.

ayllon commented 2 years ago

###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost

This seems to be coming from Firefox? Probably not (directly) related.

2022-06-01T12:00:57CEST SourceXtractor ERROR : Failed to process the frame! Property SourceXtractor::DetectionFrameInfo [ 0 ] not found!

I have seen this error being sometimes misleading. i.e. SourceXtractor::DetectionFrameInfo throws, and the plugin system "gets confused" and report that it can not be found.

It can be that I/O is flaky or something of the sort.

I am going to try and force errors inside that property and see if I can track where does the error go missing.

mkuemmel commented 2 years ago

Some findings of today:

After a long discussion in the telecon today the conclusions were:

Alejandro and Marc are going to make tests on an EDEN-3.0 system to reproduce the problem.

mkuemmel commented 2 years ago

The last tidbit from my side, I ran the data on two other systems:

There are no problems on my laptop and the 'usual' problems above on the EDEN-3.0 docker image. This strengthens that:

I pinged Frederic Leroux this morning, but I guess these people react to issues only...

ayllon commented 2 years ago

I managed to reproduce the issue this morning in lodeen. However, using gdb didn't help since there are many exceptions raised as part of the normal flow. They are well handled, but obscure the issue. Ironically I have changed this so the normal flow does not depend on exceptions and the issue went away. I may need to run more times see if it comes back.

ayllon commented 2 years ago
  • 2022-06-02T11:12:56CEST SourceXtractor ERROR : Failed to process the frame! _Map_base::at

I have accidentally caused this exception when modifying other parts of the code. I think it comes from https://github.com/astrorama/SourceXtractorPlusPlus/blob/87ad730142d1fb1deeb1bb797c9f560d1a240c00/SEFramework/src/lib/Task/TaskProvider.cpp#L33-L53

m_task_factory_registry->getFactory tries to get the property from a std::map, which throws that exception if the key is not present.

I suspect perhaps a race condition? That would explain why most exceptions seem to originate from the same area (get a property). DetectionFrameInfo is not handled by any plugin, so that would explain the exception from std::map::at when looking for a task. Depending on how things interleave, you get an error on one place, another, or runs just fine. That would explain why single threaded things work.

mkuemmel commented 2 years ago

That code was already in v0,17?

Is there an explanation why no one fell over that earlier?

ayllon commented 2 years ago

That code was already in v0,17? Is there an explanation why no one fell over that earlier?

No idea, maybe something else changed that made this more likely. Btw, I just got a segfault from DetectionFrameInfoTask.cpp when trying to set the property.

I am 99% sure now this is a race condition, because I can see notifyObservers on the stack trace.

ayllon commented 2 years ago

Is there an explanation why no one fell over that earlier?

Ah, I understand now. This line is the difference:

https://github.com/astrorama/SourceXtractorPlusPlus/blob/eb09a460f4d2702c85c890612713e556f3354d7b/SEImplementation/src/lib/CheckImages/DetectionIdCheckImage.cpp#L34

That triggers the computation of DetectionFrameInfo on the thread that is doing the segmentation. By then, the source has gone into the prefetcher / measurement thread pool, so you have a race condition between two threads trying to set the properties of the same source. Probably even the same property.

The most obvious solution is to throw a mutex to it, but I am concerned that the "constness" of the object and the getProperty methods are misleading.

Contention will not likely be a problem, though, since most of the time only one thread accesses the object, but then we are paying locking/unlocking of a mutex every time we set/get a property.

P.S The shared_ptr is const, not the object it points to.

mkuemmel commented 2 years ago

I can confirm that #480 does solve this problem.

I ran the problematic dataset with different # of cores on the SDC machine:

# of cores   time
  1          00:08:57
  4          00:08:55
  8          00:11:12

On my laptop things are a bit different:

# of cores   time_#480      time_buggy
  1           00:08:17       00:08:25
  2           00:07:28       00:07:00
  4           00:06:48       00:06:02

So on my laptop multi-threading makes sense. The bugfix costs performance, on the other hand even when the buggy code ran through its not clear whether the result makes sense.

ayllon commented 2 years ago

The bugfix costs performance

That is to be expected, but it is the smallest change that fixes the issue.

Another option is to split the Observable/Observer pieces. i.e., right now it is a one-to-many, but we would need a one-to-one, where once the source is passed to the next step, the ownership is lost. For instance, kind of a Pipeline interface.

We could avoid a mutex in this way. I think it should be relatively straight-forward, but there may be ramifications.