log likelihood for partial observed dataset

yfpeng1234 commented 2 weeks ago

Hi dear author, I encountered a problem with DBN evaluation. If I learn a DBN from training-set, where all the values are observable. While for testing, some of the variables are masked out. For example, we have "a_t_1,b_t_1,c_t_1,a_t_0,b_t_0,c_t_0", while values of "a_t_1,c_t_0" are masked. If I want to evaluate my DBN with this test set, for example, compute the log likelihood for the test set, should I use the learned DBN to infer the values of masked nodes, then compute the whole likelihood? Or could you please suggest another method?

dkesada commented 2 weeks ago

Hi! I'm afraid I don't quite understand what you mean by "masked" nodes. Do you perhaps mean that you do not know the values of those variables and you want to compute the log-likelihood? If so, then you can use the logLik() function from the bnlearn package with DBNs and datasets from dbnR. That function allows you to compute the log-likelihood of some of the nodes of the network and not the whole network. Here's a reproducible example that computes the log-likelihood of two nodes in a network:

library(dbnR)
library(bnlearn)

dt <- dbnR::motor
dt_train <- dt[1:2800]
dt_test <- dt[2801:3000]
size <- 2
f_dt_train <- fold_dt(dt_train, size)
f_dt_test <- fold_dt(dt_test, size)

net <- learn_dbn_struc(dt_train, size, method = "dmmhc", f_dt = f_dt_train)
fit <- fit_dbn_params(net, f_dt_train)

logLik(fit, f_dt_test, nodes = c("pm_t_0", "ambient_t_1"), debug = TRUE)

imagen The total log-likelihood is 1275.789, and given that the log-likelihood is a decomposable score, you can see each node's score independently: 871.62 for pm_t_0 and 404.17 for ambient_t_1. If you do not know the value of some variable in t_0, then predict it first and then compute the likelihood

yfpeng1234 commented 2 weeks ago

thanks so much for your reply. This is exactly what I mean. Actually，I hope the log likelihood computation could marginalize the variables that I don't know the values

dkesada commented 2 weeks ago

This falls more on the side of the log-likelihood computation in bnlearn, but I'd say that you can just perform inference on the values that you do not know and then calculate the log-likelihood. Afterall, those calculated values are the most likely values for each variable given the evidence, they should be equivalent to marginalizing the variables. You could use either the mvn_inference() or the predict_dt() functions to get those missing values

yfpeng1234 commented 2 weeks ago

sure, this is really a insightful suggestion, I will try it, thanks so much for your help again

dkesada / dbnR

log likelihood for partial observed dataset #30