Closed John-Boik closed 1 year ago
Thats weird...Do you use the options = (limit_stack_depth = 500, )
option for the inference
function? Could you try to run inference without this option?, it is not always necessary (only for large models). This is the only place where we use Tasks
to emulate infinite stack-trace in our code and it may causing this issue, but I would say it's a bug in Julia or a buggy interaction between RxInfer and Evolutionary. I don't know how Evolutionary works.
I can see in the stack trace provided that the segfault happens in limitstack
function so it is definitely related to the options = (limit_stack_depth = 500, )
setting. I guess Evolutionary tries to run several inference
procedures in parallel and this may cause problems with async Tasks?
I can also verify that the sigterm fault arises when using Optim minimization (e.g., LBFGS), rather than Evolutionary. It occurs after 700 to 1700 function calls. Good to know that the fault is related to limit_stack_depth
. My problem is large, however, and will not run without limit_stack_depth
. I use limit_stack_depth=900
. Any ideas on preventing the sigterm fault for this situation?
Also, the sigterm happens whether I run Evolutionary using multiple threads (parallelization = :thread
) or not.
I'm not sure, we've never encountered this issue before and generally segfaults should never occur in Julia. We do not do anything "forbidden" in Julia and only the standard Task
object. I would say its a bug in the Task
object. Would it be too difficult to verify if an issue also occurs in different versions of Julia and also on the master
branch? Tthat would require you to compile the Julia itself, but the process is quite easy, you basically just need to clone the Julia repository and run make
.
Looks like it was a memory issue on my end, and a non-graceful exit on the part of Linux. I updated Ubuntu from 20.04 to 23.04 and now the exit is graceful (a clean kill event). I also fixed the memory issue.
As a late addendum after I closed this issue, I have a perhaps helpful tip. In my code, an outer optimization function (e.g., LBFGS) repeatedly calls a subfunction that contains result = RxInfer.inference(...)
. In order to prevent memory consumption from blowing up (at about .1GB per 10 seconds), I needed to include the following prior to the return of the subfunction:
result = nothing
GC.gc()
The above solved my memory issue.
I'm using the genetic algorithm package Evolutionary to help tune/explore some parameters in my RxInfer model. My RxInfer model has rarely caused a segfault/sigterm when run alone, but when called repeatedly by Evolutionary (either Evolutionary.GA or Evolutionary.ES), a sigterm almost always occurs after multiple (about 20) generations. I don't know what is causing the problem, but it does appear to be originating in RxInfer. Can anyone shed light on what the problem might be and/or a possible solution or workaround?
I can't share the code here, but the error message is included below. My formal model is called using the
RxInfer.inference()
function, the call to which is included in the error message (line 618:fxi at /[path]/RxModel.jl:461
, wherefxi
is my function that contains the call toinference
, which occurs on line 461). The first mention of Rocket occurs on line 376. Before that, there are numerous repeating identical or near-identical 15- line sections that start withpthread_cond_wait at...
.