furrer-lab / abn

Bayesian network analysis in R
https://r-bayesian-networks.org/
GNU General Public License v3.0
5 stars 0 forks source link

node specific max.parents not implemented for method = "mle" #55

Open matteodelucchi opened 8 months ago

matteodelucchi commented 8 months ago

Issue description

Only with method = "bayes" we can set the number of maximal allowed parents individually per node.

MRE

### Generate data
# Set seed for reproducibility
set.seed(123)

# Number of groups
n_groups <- 5

# Number of observations per group
n_obs_per_group <- 100

# Total number of observations
n_obs <- n_groups * n_obs_per_group

# Simulate group effects
group <- factor(rep(1:n_groups, each = n_obs_per_group))
group_effects <- rnorm(n_groups)

# Simulate variables
G1 <- rnorm(n_obs) + group_effects[group]
B1 <- rbinom(n_obs, 1, plogis(group_effects[group]))
G2 <- 1.5 * B1 + 0.7 * G1 + rnorm(n_obs) + group_effects[group]
B2 <- rbinom(n_obs, 1, plogis(2 * G2 + group_effects[group]))

# Create data frame
data <- data.frame(group = group, G1 = G1, G2 = G2, B1 = factor(B1), B2 = factor(B2))

# Look at data
str(data)
summary(data)

######
# Reproduce issue
######
### method = "mle"
# OK: Build the score cache with 2 parents for each variable
score_cache <- buildScoreCache(data.df = data,
                               data.dists = list(G1 = "gaussian", 
                                                 G2 = "gaussian", 
                                                 B1 = "binomial", 
                                                 B2 = "binomial"),
                               group.var = "group",
                               max.parents = 2,
                               method = "mle")

# BUG: Build the score cache with different number of parents for each variable
score_cache <- buildScoreCache(data.df = data,
                               data.dists = list(G1 = "gaussian", 
                                                 G2 = "gaussian", 
                                                 B1 = "binomial", 
                                                 B2 = "binomial"),
                               group.var = "group",
                               max.parents = list(G1 = 0, G2 = 2, B1 = 0, B2 = 3),
                               method = "mle")

### method = "bayes"
# OK: Build the score cache with different number of parents for each variable
score_cache <- buildScoreCache(data.df = data,
                               data.dists = list(G1 = "gaussian", 
                                                 G2 = "gaussian", 
                                                 B1 = "binomial", 
                                                 B2 = "binomial"),
                               group.var = "group",
                               max.parents = list(G1 = 0, G2 = 2, B1 = 0, B2 = 3),
                               method = "bayes")
matteodelucchi commented 8 months ago

I don't quite understand why the creation of the parent combination matrix defn.res differs in buildScoreCache.mle() and buildScoreCache.bayes(). The Bayes case uses the C function buildscorecache.c. This looks reasonable, though not super efficient (it iterates twice over the same nested for loops). buildscorecache.c handles the different number of max.parents per node in the Bayes case.

Thoughts:

  1. Is the Bayes variant with buildscorecache.c limited by not handling multinomial variables?
  2. If yes, can we easily extend it? E.g. following the approach in buildscorecache.mle(), which splits them up in their levels after the parent combination matrix has been created in the first place. This would be a first step to the extension of the Bayes framework to multinomials -> make separate issue/milestone.