bmihaljevic / bnclassify

Learning Discrete Bayesian Network Classifiers from Data
18 stars 10 forks source link

Cross validation in tan_cl classification #26

Open biotech25 opened 7 years ago

biotech25 commented 7 years ago

Hi,

Could you help me one more time?

I want to run cross validation in 'tan_cl' function to learn structure and to get predicted class label and probabilities for each label.

As we know, 'tan_hc' function runs cross-validation; tn<-tan_hc("class", car, k = 10, epsilon = 0, smooth = 1) From this cross-validation output, I can use 'predict' function and get predicted class label and probabilities; predict(tn, bndata, prob = TRUE)

However, 'tan_cl' function doesn't have cross-validation function; tn <- tan_cl('class', car, score = 'aic') After this 'tan_cl' function, if I use 'predict' function, the predicted class label and probabilities are not from cross-validation. The best way will be tweaking 'tan_cl' source code to insert cross-validation function, like the 'tan_hc' source code.

Yes, there is 'cv' function; cv(tn, car, k=10) but this is to get prediction accuracy. So, it doesn't solve my problem. I looked at the source code of 'cv' as I might be able to insert cross-validation function and obtain the predicted class label and probabilities. However, the source code is not fully provided. For example, I can't run 'ensure_multi_list' function inside 'cv' function.

Could you help me about this? Thank you,

Sanghoon

ghost commented 7 years ago

Hi Sanghoon.

Doing tnh <- tan_hc, as you correctly say, uses cross-validation to learn its structure, by evaluating the accuracy of different candidate structures. On the other hand, tnc <- tan_cl learns its structure and parameters using the full data set, with an option to use Bayesian parameter estimation, which is more 'robust' than maximum likelihood.

Yet, when you call predict(tnh, bndata) or predict(tnc, bndata) there is no cross-validation involved: you get the probabilities for bndata using a model (tnh or tnc) that has already been learned. If want you want is to learn k models from a training subset of your data, and use it to predict the remaining test data, then neither tan_hc nor tan_cl are doing the trick. In both cases, when you use predict, the model has already been learned.

To achieved that, as you say, you would have to start from the cv function. In particular, I think that the function you should look at is `update_assess_fold'.

Best, Bojan

biotech25 commented 7 years ago

Thank you so much for your explanation. As I read your explanation, I realized that I misunderstood that cross-validation was involved in 'predict(tnh, bndata)'. I was wrong.

As you recommended, I am looking at the source codes of cv' and 'update_assess_fold' functions. I found that, even in the 'cv' source code, there are many functions I need to call. For example, there is 'ensure_multi_list' and 'get_common_class' functions in the beginning of 'cv' source code. Also, 'ensure_multi_list' needs to call 'is_just' function, so I am finding all source codes of functions at github. (I am not sure whether I am doing correctly)

I will endeavor more and let you update. Thank you so much.

Sanghoon