dkesada / dbnR

Gaussian dynamic Bayesian networks structure learning and inference based on the bnlearn package
GNU General Public License v3.0
44 stars 10 forks source link

Data set division #13

Closed 1369959395 closed 2 years ago

1369959395 commented 2 years ago

Hi, I found a problem when I was doing the network structure construction, different data set division will lead to different constructed network structure, does it have any effect and what is the best ratio? Thank you very much!

dkesada commented 2 years ago

Hi! It is normal to obtain different network structures with different training datasets. All the structure learning algorithms in dbnR are data driven, and like any other machine learning algorithm the models obtained will vary based on the data we use to learn the model. However, it is expected that with similar training datasets we obtain models that perform similarly to one another.

Depending on your dataset and on the problem you are solving, there are different approaches you can take to splitting the data into training and test:

If you have several independent and identically distributed samples of the same multivariate time series

In this case you can perform k-fold crossvalidation to evaluate the global performance of all the different DBN models learned. This scenario usually appears in time series data when we have a system that runs for a period of time, then stops and it is returned to its initial state and starts running again. An example of this type of data can be:

3cyclesv2

If you had data similar to the last example, where the variables in your system follow clear cycles from start to finish, then crossvalidation applies nicely. The function filtered_fold_dt can be used to fold a dataset respecting its cyclic structure if needed.

If you have a single, long multivariate time series

In this case, there are several approaches in the literature to how to partition your data into training and test. Usually, the most common one is to use holdout: you take the last 20% of instances (the percentage does not need to be 20% exactly, but that is a normal quantity to use) and use the initial 80% to train a DBN model. The main idea of this splitting is that you are using older data to train a model and predict newer data, which is similar to what you would have to do in the real world. The biggest drawback of this method is that you train a single DBN, and so the evaluation of the model is much weaker than with crossvalidation.

Another approach is to divide your dataset into several 'independent' fractions of the original time series. These fractions are then divided into train and test and used to train and evaluate a DBN model each and afterwards average the results. An example of this would be: imagen Keep in mind that we are working with multivariate time series, I'm only showing one of the time series for the sake of simplicity, but the data has to be split row-wise across all columns. The problem of this approach is that you train models with quite smaller training datasets in comparison to the original one. A more relaxed approach of this case is to perform crossvalidation with the different partitions and treat them as "independent" samples. So in the previous example, instead of 5 models you would have 5 subsets of data that you could use for a 5-fold crossvalidation, for example. However, this approach is less rigorous than the other options, because you could end up using data in the future to train your model and predict the test data in the past,

I hope I could be of help. Best regards!

1369959395 commented 2 years ago

Thank you for your detailed answer. I will continue to think about it.