dkesada / dbnR

Gaussian dynamic Bayesian networks structure learning and inference based on the bnlearn package
GNU General Public License v3.0
44 stars 10 forks source link

Questions about Integration of dbnR with Python, Hidden variables and filtered_fold_dt #17

Closed MLunar closed 1 year ago

MLunar commented 1 year ago

Hello, Sorry to bother you but I really have some confusions to be solved urgently. 1.About the Integration of dbnR with Python Why I can't run the codes in ''python_integration.ipynb'' successfully, the error is: AttributeError: 'DataFrame' object has no attribute 'iloc'. at In[3] line 22.

2.About Hidden variables in DBN Does dbnR support the dynamic bayesian network with hidden variables? or Can dbnR fit the parameters using Expectation Maximization?

3.About filtered_fold_dt I'm confused about how to deal with a set of data consisting of multiple different time series, you mentioned that can use function filtered_fold_dt. I tried but it seems to be useless, the fitting result is still the same.

dkesada commented 1 year ago

Hi! No worries, I'll answer to each of your questions in order:

  1. Thank you for the heads up! It seems that rpy2 has changed its automatic DataFrame conversion to Python, and now it has to be written explicitly. I have fixed the markdown by adding the line motor = robj.conversion.rpy2py(motor) after reading the data. You can find the fixed version in the devel branch (82a3778), it works fine for me now.

  2. dbnR does not support missing data in the dataset and it does not have an expectation maximization algorithm implemented, nor its structural version either. As an alternative, given that this is an "external" part of learning a DBN, you could do the iterations externally by creating the hidden variables in the dataset yourself, assigning the initial random values, fitting a DBN to that complete data and iterating by generating the missing data either with particle filtering inference or exact inference until convergence. I would like to add the EM algorithm at some point to dbnR, but I'm afraid that right now I do not have the time to do so.

  3. For the filtered_fold_dt function to work properly, you need to have an index variable that identifies each different time series inside your dataset. I'll illustrate this with an image and some code:

Let's suppose that we have a dataset consisting of 3 sensors, X, Y and Z, that register some values of a system over time. Suppose that we have 3 different time series recorded from this system, with 4 time instants each. Our dataset could look something like this:

library(dbnR)
library(data.table)

set.seed(42)

# First I create some random data for the example 
dt <- as.data.table(matrix(round(runif(n=27, min=1, max=10)), nrow=9, 
                           dimnames = list(NULL, c("X", "Y", "Z"))))
dt

imagen

Now I create the index that identifies each different time series

dt[, id := c(sapply(1:3, rep, 3))]
dt

imagen

I highlighted with some color the 3 different time series. Now, when we fold the dataset we do not want the values of different time series to intermingle with each other, because their values are independent. If we fold the dataset with the regular fold_dt function, we obtain this:

f_dt_1 <- dbnR::fold_dt(dt, size = 2)
f_dt_1

imagen

As you can see, the values of the time series are all mixed up on the borders. You can see that the id_t_0 and the id_t_1 columns have in the same row values from different time series, indicating that the last values of the previous time series precede the first values of the next one, which is false. This issue is further aggravated with higher size values. However, if we use filtered_fold_dt, we obtain the following result:

f_dt_2 <- dbnR::filtered_fold_dt(dt, size = 2, id_var = "id", clear_id_var = F)
f_dt_2

imagen

In this last case, the time series are not mixed in the same rows, and no fake values are introduced. If we take a closer look, we can see that f_dt_2 is just f_dt_1 but with the invalid rows deleted automatically. If you train a DBN model with both datasets, the results should differ in most cases, because the data that you are feeding the model is different. Depending on the dataset, this difference can be unnoticeable, because if you set size = 2 and the dataset has, for example, a docen of lengthy time series, then the only difference between the two datasets will be 11 deleted rows from all the thousands of rows in the data.