furrer-lab / abn

Bayesian network analysis in R
https://r-bayesian-networks.org/
GNU General Public License v3.0
5 stars 0 forks source link

speed up buildScoreCache #58

Open matteodelucchi opened 10 months ago

matteodelucchi commented 10 months ago

Dear Matteo,

Included is a patch from the latest version to help scale building the score cache for the mle option. I'm using a "Sparse Candidate" type algorithm, so the number of possible parents is normally quite constrained, in the region of 10s. This algorithm also needs to be able to check the scoring on adding a single node, so I've had to make max.parents per node (I'm not sure why it was forbidden before). I've tested this on features running to 1000s and it seems to work quite well and replicates the previous results.

I also have code that scales the hill climbing algorithm to 1000s of variables, in R, but this is missing some of the functionality of the C code, so I won't offer it yet.

Any questions, please get in touch!

Many thanks,

Rónán

On 15 Nov 2023, at 16:57, Delucchi Matteo [xxx] wrote:

Dear Ronan,

Thank you for your interest in our abn package and for taking the time to provide feedback. We currently host the code on our institute’s GitLab server, which can be found at this link: https://git.math.uzh.ch/mdeluc/abn I greatly appreciate your contribution towards improving the scalability of the package. I would happily review your patch and consider incorporating it for the next release. Please don’t hesitate to reach out if you encounter any issues or have further questions. Thank you again for your feedback!

Best regards, Matteo

From: Ronan Subject: abn R package

Dear Mr Delucchi,

I've been experimenting with the abn package and found that scaling up to large numbers of nodes was causing issues with the code setting up the cache structure, specifically in buildScoreCache.mle where banned possibilities are filtered out, was causing runtime to grow perhaps quadratically. I've implemented a fix that means the code can now scale to larger examples and I'm wondering is there a way to incorporate this into the mainline of your package? I haven't seen a github repository, but could send a patch etc.

Many thanks,

Ronan

diff --git a/R/build_score_cache_mle.R b/R/build_score_cache_mle.R
index 6e3e650..5b03b3f 100755
--- a/R/build_score_cache_mle.R
+++ b/R/build_score_cache_mle.R
@@ -377,6 +377,9 @@ buildScoreCache.mle <-

     ############################## Function to create the cache

+    if ( length(max.parents) == 1 ) {
+        max.parents <- rep(max.parents, nvars)
+    }

     if (!is.null(defn.res)) {
         max.parents <- max(apply(defn.res[["node.defn"]], 1, sum))
@@ -392,83 +395,64 @@ buildScoreCache.mle <-
             return(v)
         }

-        node.defn <- matrix(data = as.integer(0), nrow = 1L, ncol = nvars)
-        children <- 1
+        ## Generate all possible bit patterns for n variables, with a maximum of m 1s
+        generateBitPatterns = function(n, m) {
+          z <- rep(0,n)
+          do.call(rbind, lapply(0:m, function(i) t(apply(combn(1:n,i), 2, function(k) {z[k]=1;z}))))
+        }

-        for (j in 1:nvars) {
-            if (j != 1) {
-                node.defn <- rbind(node.defn, matrix(data = as.integer(0),
-                                                     nrow = 1L, ncol = nvars))
-                children <- cbind(children, j)
-            }
-            # node.defn <- rbind(node.defn,matrix(data = 0,nrow = 1,ncol = n))
+        # Function to generate all possible combinations of parents
+        filteredCombinations = function(x, m, bannedParents, retainedParents) {
+          # These are the parents that cannot change
+          fixedParents = bannedParents | retainedParents | (fun.return(x, length(x) + 2) + 1) %% 2
+          # These are the parents that can change
+          parentPossibleChoices = which(fixedParents == 0)
+          numPossibleChoices = length(parentPossibleChoices)
+          numRetainedParents = sum(retainedParents)
+
+          # Generate all possible combinations of parents, taking account of banned, retained and maximum number of parents
+          parentChoices = generateBitPatterns(numPossibleChoices, min(m-numRetainedParents, numPossibleChoices)) == 1
+          output = t(apply(parentChoices, 1, function(pc) {
+            combinedRow = 1L*(retainedParents | fun.return(parentPossibleChoices[pc], length(x) + 2))
+            combinedRow
+          }))
+          output
+        }
+
+        children <- matrix(nrow=1, ncol=0)
+        node.defn.list = list()

+        for (j in 1:nvars) {
           if(is.list(max.parents)){
             stop("ISSUE: `max.parents` as list is not yet implemented further down here. Try with a single numeric value as max.parents instead.")
             if(!is.null(which.nodes)){
               stop("ISSUE: `max.parents` as list in combination with `which.nodes` is not yet implemented further down here. Try with single numeric as max.parents instead.")
             }
-          } else if (is.numeric(max.parents) && length(max.parents)>1){
-            if (length(unique(max.parents)) == 1){
-              max.parents <- unique(max.parents)
-            } else {
-              stop("ISSUE: `max.parents` with node specific values that are not all the same, is not yet implemented further down here.")
-            }
-          }
-
-          if(max.parents == nvars){
-            max.parents <- max.parents-1
-            warning(paste("`max.par` == no. of variables. I set it to (no. of variables - 1)=", max.parents)) #NOTE: This might cause differences to method="bayes"!
           }

-            for (i in 1:(max.parents)) {
-                tmp <- t(combn(x = (nvars - 1), m = i, FUN = fun.return, n = nvars, simplify = TRUE))
-                tmp <- t(apply(X = tmp, MARGIN = 1, FUN = function(x) append(x = x, values = 0, after = j - 1)))
-
-                node.defn <- rbind(node.defn, tmp)
-
-                # children position
-                children <- cbind(children, t(rep(j, length(tmp[, 1]))))
-            }
+            # The parents that are banned and retained for node j
+            bannedParents = dag.banned[j, ]
+            retainedParents = dag.retained[j, ]
+            # All possible parents for node j, which is all nodes except j
+            parentChoice = c(seq.int(from=1, length.out=j-1), seq.int(from=j+1, length.out=nvars-j))
+            # How many parents we are keeping for node j
+            numRetainedParents = sum(retainedParents)
+            # The maximum number of parents for node j
+            m = max.parents[j]
+
+            # Generate all possible combinations of parents for node j
+            tmp <- filteredCombinations(x = parentChoice, m=m, bannedParents=bannedParents, retainedParents=retainedParents)
+            # We need a sparse matrix here to deal with large numbers of variables, otherwise memory usage if very high.
+            tmp2 = Matrix(tmp, sparse = TRUE)
+            node.defn.list[[length(node.defn.list) + 1]] <- tmp2
+            children <- cbind(children, t(rep(j, length(tmp2[, 1]))))
         }

-        # children <- rowSums(node.defn)
+        node.defn = do.call(rbind, node.defn.list)
         colnames(node.defn) <- colnames(data.df)
-        ## Coerce numeric matrix into integer matrix !!!
-        node.defn <- apply(node.defn, c(1, 2), function(x) {
-            (as.integer(x))
-        })
-
         children <- as.integer(children)
         # node.defn_ <- node.defn

-        ## DAG RETAIN/BANNED
-        for (i in 1:nvars) {
-            for (j in 1:nvars) {
-
-                ## DAG RETAIN
-                if (dag.retained[i, j] != 0) {
-                  tmp.indices <- which(children == i & node.defn[, j] == 0)
-
-                  if (length(tmp.indices) != 0) {
-                    node.defn <- node.defn[-tmp.indices, ]
-                    children <- children[-tmp.indices]
-                  }
-                }
-
-                ## DAG BANNED
-                if (dag.banned[i, j] != 0) {
-                  tmp.indices <- which(children == i & node.defn[, j] == 1)
-
-                  if (length(tmp.indices) != 0) {
-                    node.defn <- node.defn[-tmp.indices, ]
-                    children <- children[-tmp.indices]
-                  }
-                }
-
-            }
-        }
-
         mycache <- list(children = as.integer(children), node.defn = (node.defn))

         ###------------------------------###
matteodelucchi commented 8 months ago

Before this is started, check out issue furrer-lab/abn#55 , which might render this unnecessary.