bd2kccd / r-causal

R Wrapper for Tetrad Library
35 stars 19 forks source link

Memory leak in tetradrunner when resampling #103

Open MichaelVBronstein opened 2 years ago

MichaelVBronstein commented 2 years ago

Thanks for the software. It has been really helpful to us.

I believe there is a memory leak in r-causal's tetradrunner when resampling is selected. Code to reproduce is as follows:

library("rJava") library("rcausal") library("stringr") library("parallel") library("doParallel") cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS registerDoParallel(cluster)

for args to tetradrunner, see: cmu-phil.github.io/tetrad/manual

for (i in 1:numDAGs){ tetradrunner <- tetradrunner(algoId='gfci',df=Data,dataType = "mixed",scoreId="sem-BIC",alpha=0.01, faithfulnessAssumed=F, maxDegree=100, verbose=T,maxPathLength = -1,completeRuleSetUsed = T,faithfulnessAssumed = F,penaltyDiscount=1,numberResampling = 1000, percentResampleSize=90,resamplingEnsemble = 1,addOriginalDataset = T,resamplingWithReplacement=F) } When you get above 15 or so loops (numDags=15+) you start having errors that I think are due to the system being OOM. Tetradrunner, in this case, will just output a list of the node names. I tried running GC on the java memory after each call to tetradrunner with: gc() J("java.lang.Runtime")$getRuntime()$gc() which improves but does not fix the issue, which is why I think there is a memory leak. Increasing the heap memory size does not help, nor does running the code in serial rather than in parallel. Removing the resampling options fixes the issue, which is why I think the resampling code is specifically where the leak is occurring.

MichaelVBronstein commented 2 years ago

(the leak only seems to occur when using jdk1.8.0_144...._322 is fine, I believe)