Fix CUDD reference count

thiagomael commented 4 years ago

We are consistently facing the following error during long runs of the evolution-aware prototype (6f1f63fc3cd7a328371e785cec6d42e0ad15928a):

cuddGarbageCollect: problem in table 0
    dead count != deleted
    This problem is often due to a missing call to Cudd_Ref
    or to an extra call to Cudd_RecursiveDeref.
    See the CUDD Programmer's Guide for additional details.

thiagomael commented 4 years ago

@brenoafb: one possible course of action is to build CUDD with debugging primitives, as explained in the programmer's guide: http://web.mit.edu/sage/export/tmp/y/usr/share/doc/polybori/cudd/node4.html#SECTION00047000000000000000.

If that alternative seems promising, feel free to open a new issue to address that problem.

thiagomael commented 4 years ago

I did some digging and recalled that this problem first appeared 5 years ago when we were calling Cudd_RecursiveDeref() in the overridden ADD.finalize() method (8743cef3297715719c73ff5b5075062eadd84ace). Some time after that (8743cef3297715719c73ff5b5075062eadd84ace), we commented out the de-referencing call, but the commented code was Cudd_Deref(), instead of the recursive version. This can be misleading since we did not persist our intermediate attempts to correct the issue.

brenoafb commented 4 years ago

@brenoafb: one possible course of action is to build CUDD with debugging primitives, as explained in the programmer's guide: http://web.mit.edu/sage/export/tmp/y/usr/share/doc/polybori/cudd/node4.html#SECTION00047000000000000000.

If that alternative seems promising, feel free to open a new issue to address that problem.

I've tried invoking the Cudd_DebugCheck to diagnose memory problems before attempting the preinitialized constants solution, but the messages weren't actionable for the problem at hand. I think that these debug functions could be useful if we attempt to solve the ref-deref cycle problems in a manner that's isolated from the persistence and evolution functions, maybe by treating the ADD-related packages as a standalone library.

thiagomael commented 4 years ago

@brenoafb: one possible course of action is to build CUDD with debugging primitives, as explained in the programmer's guide: http://web.mit.edu/sage/export/tmp/y/usr/share/doc/polybori/cudd/node4.html#SECTION00047000000000000000. If that alternative seems promising, feel free to open a new issue to address that problem.

I've tried invoking the Cudd_DebugCheck to diagnose memory problems before attempting the preinitialized constants solution, but the messages weren't actionable for the problem at hand. I think that these debug functions could be useful if we attempt to solve the ref-deref cycle problems in a manner that's isolated from the persistence and evolution functions, maybe by treating the ADD-related packages as a standalone library.

Indeed, but the problem is not happening in a controlled setting. What do you think about logging the calls to CUDD functions in a failed ReAna execution, then replaying them in the isolated environment?

brenoafb commented 4 years ago

@brenoafb: one possible course of action is to build CUDD with debugging primitives, as explained in the programmer's guide: http://web.mit.edu/sage/export/tmp/y/usr/share/doc/polybori/cudd/node4.html#SECTION00047000000000000000. If that alternative seems promising, feel free to open a new issue to address that problem.

I've tried invoking the Cudd_DebugCheck to diagnose memory problems before attempting the preinitialized constants solution, but the messages weren't actionable for the problem at hand. I think that these debug functions could be useful if we attempt to solve the ref-deref cycle problems in a manner that's isolated from the persistence and evolution functions, maybe by treating the ADD-related packages as a standalone library.

Indeed, but the problem is not happening in a controlled setting. What do you think about logging the calls to CUDD functions in a failed ReAna execution, then replaying them in the isolated environment?

I just tried executing Cudd_DebugCheck at the end of the feature-family-based analysis code (right before the return statement) and took note of the results across three different builds (master, ddsPersistence, reanaE).

On the master branch, there are no errors whatsoever.

On ddsPersistence, when running without cache (first run), I got "Error: wrong number of total nodes in constants", which seems related to the use of predefined constants for 1 and 0. This error indeed disappears when running without using predefined constants, but, as reported previously, crashes in the cuddAddIteRecur. When running with caches, there are no errors.

I got several errors in the reanaE branch. When running for the first time (no evolution, 0th model), execution finishes normally and the debug check shows no errors. When setting evolution to true and running on the 1st model, I get hundreds of lines of the form

parent is at 0x7fa6783c4d00, id = 13, ref = 1, then = 0x7fa6783c46c0, else = 0x7fa6782ff4c0
parent is at 0x7fa6783d1300, id = 13, ref = 2, then = 0x7fa6783d0520, else = 0x7fa6782ff4c0
parent is at 0x7fa67836c5e0, id = 13, ref = 4, then = 0x7fa6782ff4c0, else = 0x7fa67836c400

and get a SIGSEGV in the Cudd_DebugCheck function. So there seems to be something going wrong way before normal execution breaks around the 13th model, or when reordering is enabled.

thiagomael commented 4 years ago

I changed Expression<T> to also carry the original (unparsed) string along, for debugging purposes. The result is consistently the same: at the 13th evolution of BSN, when performing the division by 2 in the expression (1*Capture*QoSChange*Reconfiguration*Situation+1*Capture*QoSChange*Situation)/(2), CUDD prints the error message and aborts the process.

Since it is a division by 2, maybe there is something particularly broken with CUDD constants...

thiagomael commented 4 years ago

Since the error message mentions too few calls to Cudd_Ref() or too many calls to Cudd_RecursiveDeref(), I tried calling Cudd_Ref() 3 times per ADD creation and never calling Cudd_RecursiveDeref(). Still, the failure remains consistent.

thiagomael commented 4 years ago

I read the CUDD's C++ bindings, and the code there is exactly what I thought it would be (and analogous to what we implemented in JADD): constructor calls Cudd_Ref() and destructor calls Cudd_RecursiveDeref(). I really don't get what is causing our issue...

thiagomael commented 4 years ago

@andrelanna, @brenoafb, @tobiassena: https://add-lib.scce.info/

A new hope, maybe? :)

brenoafb commented 4 years ago

I read the CUDD's C++ bindings, and the code there is exactly what I thought it would be (and analogous to what we implemented in JADD): constructor calls Cudd_Ref() and destructor calls Cudd_RecursiveDeref(). I really don't get what is causing our issue...

Yep, I've also taken a look at this Python library, and memory management is done exactly as one would expect. The only notable aspect is that there is a recursive bit that indicates whether to use Cudd_Deref or Cudd_RecursiveDeref. How we would keep track of whether an ADD is recursive or not is a whole other story.

@andrelanna, @brenoafb, @tobiassena: https://add-lib.scce.info/

A new hope, maybe? :)

Interesting, although their approach seems a bit different than ours. It seems like ADDs are kept track of as a pointer -- the API is just like the C version, still requiring manual refs and derefs (?). I'll take a further look at it.

andrelanna commented 4 years ago

Two days ago I sent an email for Fabio Somenzi (CUDD's developer) and Stefano Quer (one of DDDmp developers staff) reporting our issues reading ADDs. I did not receive any answer yet. Meanwhile I read about the add-lib, suggested by @thiagomael .

@andrelanna, @brenoafb, @tobiassena: https://add-lib.scce.info/

A new hope, maybe? :)

I think it is promising! There is a class named DDSerializer (info.scce.addlib.serializer.DDserializer.java) that seems to implement the DDs persistence.

thiagomael commented 4 years ago

I changed a little bit the way by which variables are constructed when reusing previous analyses (2cb3c3eae6e2ef2df6b8c1d2b3a703cbdd6ae249). Instead of dumping and reading variable ADDs, we only read variable names, then use regular calls to JADD.getVariable(name) to introduce such variables to CUDD.

However, the lack of automated execution is hampering our ability to thoroughly test this proposed solution. Thus, I suggest that we hold this issue open until #18 is resolved.

SPLMC / reana-spl

Fix CUDD reference count #16