dkesada / dbnR

Gaussian dynamic Bayesian networks structure learning and inference based on the bnlearn package
GNU General Public License v3.0
44 stars 10 forks source link

prediction in case unknown target node #26

Closed RKSKEKF closed 7 months ago

RKSKEKF commented 7 months ago

Hi i have some question about prediction

After make a model we want to predict the data

in general case we input data without target node like this

library(bnlearn)
dtraining.set = learning.test[1:4000, ]
dvalidation.set = learning.test[4001:5000, ]
dvalidation.set_R = learning.test[4001:5000, -5] #remove target node "E"

dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
dfitted = bn.fit(dag, dtraining.set)

pred = predict(dfitted, node = "E", 
               data = dvalidation.set,
               method = "parents")

pred_R = predict(dfitted, node = "E", 
               data = dvalidation.set_R,
               method = "parents")

table(dvalidation.set$E, pred)
table(dvalidation.set$E, pred_R)
table(pred, pred_R) #same result

but in this case does not it worked when remove the target node

library(dbnR)
size = 3
data(motor)
str(motor)
dt_train <- motor[200:900]
dt_val <- motor[901:1000]
dt_val_R <- motor[901:1000,-8] #remove target mode "PM"

# With a DBN
obj <- c("pm_t_0")
net <- learn_dbn_struc(dt_train, size)

f_dt_train <- fold_dt(dt_train, size)
f_dt_val <- fold_dt(dt_val, size)
f_dt_val_R <- fold_dt(dt_val_R, size)

fit <- fit_dbn_params(net, f_dt_train, method = "mle-g")

fit$pm_t_2

res <- suppressWarnings(predict_dt(fit, f_dt_val, obj_nodes = obj))
res_R <- suppressWarnings(predict_dt(fit, f_dt_val_R, obj_nodes = obj)) #not worked

am I misunderstanding and using it incorrectly?

dkesada commented 7 months ago

Hi, in the case of DBNs and time series data this is a little bit different from the way bnlearn works. With time series data, you usually do not have the objective variable in the "t_0" instant, that is, the future one you want to predict. It is assumed that, in this case, you have the values for "pm_t_1" and "pm_t_2" from previous instants, and you want to predict the next "pm_t_0". In static BNs, you can just remove the objective column because you will never have the values of that objective variable. If you want to have the pm variable missing also in "t_1" and "t_2", then you have to set obj <- c("pm_t_0", "pm_t_1", "pm_t_2"), but I'll assume this is not your case for the following example:

library(data.table)
library(dbnR)

size = 3
data(motor)
str(motor)
dt_train <- motor[200:900]
dt_val <- motor[901:1000]

# With a DBN
obj <- c("pm_t_0")
net <- learn_dbn_struc(dt_train, size)

f_dt_train <- fold_dt(dt_train, size)
f_dt_val <- fold_dt(dt_val, size)
f_dt_val_R <- copy(f_dt_val)
f_dt_val_R[, pm_t_0 := NaN]

In dbnR, you need to provide datasets where all variables inside the DBNs are present as columns in the data, even if the values are missing. This is due to the fact that I need to perform the dataset partition inside the function calls, and I already remove the variables in "t_0" from the dataset when performing the predictions. In the above code, I fold the dataset and then replace the objective column "pm_t_0" with NaNs. Now I'll show that this action has no effect in the results, because the objective column is not used during the predictions:

fit <- fit_dbn_params(net, f_dt_train, method = "mle-g")

fit$pm_t_2

res <- suppressWarnings(predict_dt(fit, f_dt_val, obj_nodes = obj, verbose = F))
res_R <- suppressWarnings(predict_dt(fit, f_dt_val_R, obj_nodes = obj, verbose = F))
table(res$pm_t_0, res_R$pm_t_0) # same result

You obtain the same results whether you remove the objective variable or not from the dataset. In fact, you do not use any of the variables in "t_0" for the predictions, because that would be introducing information from the future into your predictions, and that would be look ahead bias. All in all, you do not have to worry about removing the objective variable from the dataset in dbnR, and if you want to predict some values in a real case situation, just input a dataset with the objective column empty in "t_0".

RKSKEKF commented 7 months ago

I tried an experiment by keeping the variables in the dataset but replacing values with zeros, and the results turned out weird. I didn't think of using NaN, haha. Thanks so much for your answer!

dkesada commented 7 months ago

The above code should also work in the same way if you substitute the f_dt_val_R[, pm_t_0 := NaN] with f_dt_val_R[, pm_t_0 := 0], but you need to import the data.table library for that code to work properly. Otherwise you can find some unexpected behaviour because R thinks that you are using data.frames and underneath they are data.tables used inside dbnR. Anyways, I'm glad that helped!