DMMHC vs PSOHO vs NATPSOHO and conditional independence measures

dkesada / dbnR

Gaussian dynamic Bayesian networks structure learning and inference based on the bnlearn package

GNU General Public License v3.0

44 stars 10 forks source link

DMMHC vs PSOHO vs NATPSOHO and conditional independence measures #21

Closed IonasPi closed 1 year ago

IonasPi commented 1 year ago

Hello, I have a doubt regarding the measures used in dbnR to calculate the conditional independence given the data. I am not sure if I fully understand this step in dbnR. In the case of DMMHC, as you explained in the paper "Long-term forecasting of multivariate time series in industrial furnaces with dynamic Gaussian Bayesian networks" you use the exact 𝑡 test for Pearson’s correlation coefficient. We can also see this with the result of the net$learning: $test [1] "bic-g" $ntests [1] 29027 $algo [1] "rsmax2" $args $args$k [1] 2.038769 $args$alpha [1] 0.05 ---> SIGNIFICANCE LEVEL ?? $optimized [1] TRUE $restrict [1] "si.hiton.pc" $rstest [1] "cor" ---> PEARSON ?? $maximize [1] "hc" $maxscore [1] "bic-g" $illegal NULL First question, if I want to get the conditional probabilities I just need to get the pearson´s correlations and their p-value, right? That would what you do internally in dbnR?

In the case of PSOHO and NATPSOHO algorithms there is no filter, dbnR does not calculate the pearson and its t-test or another correlation measure and test to calculate the conditional probability and its significance, right? net$learning is: $whitelist NULL $blacklist NULL $test [1] "none" $ntests [1] 0 $algo [1] "empty" $args list()

This explains why the density of the network with DMMHC is much lower than in PSOHO and NATPSOHO. Could you explain why you did not apply a similar approach in the case of PSOHO or NATPSOHO? Thanks

dkesada commented 1 year ago

Hi! In the case of the DMMCH algorithm, those results that you see when you inspect net$learning are stored by bnlearn underneath when using the rsmax2 algorithm, which is only an interface to any of the underlying algorithms in bnlearn. In the case of the DMMHC algorithm with no additional parameters, it will use the max-min hill climbing algorithm. As you say, you can check the obtained p-values of each node that resulted from the conditional probability tests that bnlearn performs to learn the local structure of each node in the first step of the max-min hill climbing algorithm.

In the case of the PSOHO and natPSOHO algorithms, there are no conditional probability tests, and so there are no p-values or correlation measures. The only measure that both of these algorithms perform are log-likelihood, BIC or AIC to compare how well does a particular graph structure fit some data in comparison to another. I'll explain how these algorithms work on broad terms:

First, they both initialize a population of particles to random networks. This is important, because while the DMMCH algorithm uses conditional probability tests to learn arcs from local structures, both the PSOHO and natPSOHO start from random networks and move through the space of possible networks to find better solutions to the optimization problem.
When the populations of particles are initialized, the algorithms evaluate each graph with a fit measurement (logLik, BIC, AIC) to find the best solution so far.
Afterwards, the algorithms move the particles through the solution space, modifying existing graphs by applying operators that move each particle in reference to the global optimal graph found and the local optimal graph. In all this process, no tests are performed either, if a graph has a better score than another, then we assume that it is a better fit.

While there are no p-values for each node, you can check the logLik, AIC or BIC for each node, because all these measures are decomposable and are calculated as the sum of all nodes. Bear in mind though that while this measures indicate which structures are "better" and which are "worse", they do not indicate that the best one is any good, just that it is better than all the others. It could still be a pretty bad fit, but not as bad as the rest.

In the case of the density of the network, the DMMHC is lower than the other two likely because of two reasons: it has a limit underneath on the number of parents for each node and the conditional probability tests could be more strict than the likelihood scores. The AIC and BIC scores penalize a high number of parameters, but adding or deleting an arc is only 1 parameter in the conditional linear Gaussian formula underneath, so it does not make a huge impact. To obtain more sparse networks in these algorithms, we could add a maximum number of parents per node (which I tried when I coded the algorithms and it was highly inneficient) or use the k parameter in the BIC and AIC formulas to increase the penalization to dense networks.

IonasPi commented 1 year ago

Ok, thanks. But, there is no conditional probability between nodes in the PSOHO and natPSOHO solutions? There must be a conditional probability to be a Bayesian network, right?

dkesada commented 1 year ago

There is, the conditional independence relationships between the variables are always defined by the graph structure. However, there are different ways of searching for these relationships. In the case of the DMMHC algorithm it first uses statistical tests and then restricts the space of possible graphs to find the optimal. On the other hand, the PSOHO and natPSOHO search for possible graph structures via a fit meassure based on likelihood scores. Both approaches are valid, and in the literature there are many more different approaches to learning the graph structure of a Bayesian network outside of statistical tests.

IonasPi commented 1 year ago

Of course there are many different and valid ways to do it; but it is not just the graph structure (topology), the conditional probability has to be obtained in some way. My doubt regarding PSOHO and natPSOHO is how to get this probability, not if there is an arc or not.

dkesada commented 1 year ago

Mhm, then I don't think I understand which probability you want to get. Maybe you are referring to the maximum likelihood estimated parameters of the network? To calculate the likelihood score of a DBN given a DAG and some data, the PSOHO and natPSOHO have to calculate the MLE parameters underneath to obtain the likelihood scores. Given that the DBNs use linear Gaussian CPDs, these parameters are equivalent to the parameters of linear regressions, and are calculated via least squares. You can check how well fitted they are with the residuals, maybe that's what you are looking for. To obtain this, simply fit the parameters of the net obtained via PSOHO or natPSOHO with the function fit_dbn_params using the same dataset you used to obtain the network. That will give you the exact probabilities of each node and an idea of how well said distribution fits the original data.

IonasPi commented 1 year ago

Ok, thanks. One last question regarding a comparison between DMMCH, PSOHO and natPSOHO. Depending on the dataset, sometimes with PSOHO and natPSOHO I get the BIC, AIC, and LL value equals to -Inf, with positive values of DMMHC. The meaning of -Inf (-infinite) I understand it is the lowest value of BIC, AIC and LL, is that correct? if so, which is that value?

dkesada commented 1 year ago

If you are getting -Inf values for the BIC, AIC and LL then that indicates that there is some kind of issue with either the data you have or the networks obtained. As you say, it is the lowest possible value for the scores, but it usually indicates some kind of issue rather than the absolute worst network found so far, because even that network should have some kind of score. The score() function is part of bnlearn, and the only ways I've found that returns -Inf for calculating the scores is when the original dataset has a constant column or when we provide a dataset with not enough instances. I have never seen a case where, with the same dataset, PSOHO and natPSOHO return a graph that later on scores to -Inf and the DMMHC algorithm returns a graph that has some score, so I cannot help you more in that regard without a reproducible example. What I find particularly weird is that the PSOHO and natPSOHO algorithms already calculate the score underneath and maximize it, so returning a graph that scores to -Inf means that none of the graphs that they evaluated scored higher than that, which also points to some kind of issue with the dataset.

IonasPi commented 1 year ago

It has 60 instances and 80 columns. I have deleted some columns where 0 value is repeated frequently, reducing from 80 to 77 columns, but again the AIC BIC and LL = -Inf in the case of PSOHO and natPSOHO, and a positive value for DMMHC. In the three cases I get a graph. There is no constant column. How many instances are enough? is 60 ok? is it a ratio instances/variables? is that a problem only for PSOHO and natPSOHO algorithms in your experience? I have checked that once the variables are reduced up to 72 then I get the actual values of BIC, AIC, and LL in PSOHO and natPSOHO, so I do not know if there is an absolute minimum number of instances for those algorithms or it is a constant ratio between instances/variables or a function, any way with my dataset I have to reduce from 80 to 72 variables to get BIC, AIC and LL. Also I was interpreting that the lowest the AIC, BIC and LL the better, but you are saying the opposite ? Thanks for your comments

dkesada commented 1 year ago

From the numbers you are giving me, it's indeed an issue of too little instances and too many columns. 80 columns get transformed into 80*size columns when learning a DBN, and 60 instances is already less than the number of original columns. I've done some experiments generating synthetic networks with the generate_random_network_exp() function and for size = 3 and 80 variables I do not get anything other than -Inf unless I provide more than 98 instances.

Normally, you would minimize the AIC and BIC, but the bnlearn implementation is rescalled by a factor of -2 (you can see this in this page of the bnlearn documentation or by using ?bnlearn::score inside R, although this information is a little hidden in the note at the bottom), so it is maximized instead. You can try this behaviour yourself with the aforementioned function to generate synthetic networks. The function returns a DBN and the dataset that generated it, so you can try different networks to see that their scores with said dataset are lower than the real network.

IonasPi commented 1 year ago

Ok, thanks. I have another question. The DMMHC algorithm first performs a local search of the structure around each node. This step is done by finding the Markov blanket of each node, which is the group of nodes that makes it conditionally independent from the other nodes. To measure the conditional independence applies the t-test for Pearson ́s correlation coeficient to find the set of nodes Z that make a node X independent from all the nodes Y ∈ Y. The statistical significance is set at p-value =< 0.05. Then the local structures of all the nodes are combined constraining the step of scoring and finding the best general structure. The Bayesian information criterion (BIC) is used to score the resulting structures to measure how well a model fits the data.

Ok, but do you think it is necessary to adjust the statistical significance at p-value =< 0.01 for multiple comparisons in case you have too many variables, for example in this case with 80. I was reasoning that there is no need for that adjustment because this is a filter over the general approach. This is not a frequentist approach. In fact PSOHO and natPSOHO do not have such filter and they are methodologically correct. When the statistical significance is set at =< 0.05 you indeed filter the possibilities and in fact conceptually is like you adjust the DMMHC method. What do you think? Thanks

dkesada commented 1 year ago

From a practical point of view, I do not think that changing the significance level from 0.05 to 0.01 is going to have much of an impact in the final result of the algorithm. In my experience, these kind of tests tend to be either rejected with p-values of magnitude 10^-3 or lower, which is going to end up being rejected by both 0.05 or 0.01, or be accepted with p-values in the range of [0.1, 0.9], which are also going to be accepted no matter if you choose 0.05 or 0.01. You can still try it though and see if it has any effect, but I would guess that it will have no effect whatsoever.

From a theoretical point of view, I would agree with you in that it is not necessary to adjust the significance level in this particular case. It is just a parameter, if an algorithm is methodologically correct as you say, changing one parameter will not make it suddenly be poorly designed. It can make the algorithm give poor results (for example, this could happen if you give a significance level of 0.95 that accepts almost anything) but the algorithm would still be correct from the point of view of its definition. Both the 0.05 and the 0.01 values that you propose are valid and widely accepted, so any of the two will be ok. It remains to be seen if any difference will arise from changing from one value to another, but in my opinion you could use any of them.

IonasPi commented 4 months ago

Hi! In the case of the DMMCH algorithm, those results that you see when you inspect net$learning are stored by bnlearn underneath when using the rsmax2 algorithm, which is only an interface to any of the underlying algorithms in bnlearn. In the case of the DMMHC algorithm with no additional parameters, it will use the max-min hill climbing algorithm. As you say, you can check the obtained p-values of each node that resulted from the conditional probability tests that bnlearn performs to learn the local structure of each node in the first step of the max-min hill climbing algorithm.

Hi, I am trying to get the correlation coefficient for Pearson´s correlations and their p-value, but I do not find a way to do it. I consider those coefficients are the conditional probabilities. This is my code for structural learning: net <- dbnR::learn_dbn_struc(dt_train, size, method = "dmmhc", intra=T) f_dt_train_dmmhc <- fold_dt(dt_train, size)

Then, I got the arc strength to try to get the p-values for Pearson´s correlations, but it did not work the way I thought. Because if the structural learning filter pearson for p-value < 0.05, then I thought I should get all p-values/arc strength for criterion = "cor" with values < 0.05 but that is not the case, some of the results are > 0.05. This is the code I applied: arcst_dmmhc<-arc.strength(net, f_dt_train_dmmhc, criterion="cor")

Anyway, I also tried to apply the function ci.test () to get the coefficient and their p-value, but it does not work either cor <- ci.test(netdmmhc20in,f_dt_train_dmmhc20_in, test="cor") or cor <- ci.test(dt_train, netdmmhc20in, test="cor") or cor <- ci.test(netdmmhc20in, test="cor") or cor <- ci.test(netdmmhc20in, dt_train, test="cor")

I think I found a "manual" one-to-one way to do it.
cor <- ci.test(f_dt_train_dmmhc$y_t_7, f_dt_train_dmmhc$x_t_1, test="cor") But I am not sure it is correct. If it is correct then I should get the coefficient matrix and p-value matrix, for example with: cor<- cor (f_dt_train_dmmhc)

If you could you tell me the way you get the coefficients and their p-value it would be great.

All previous approaches consider that the Pearson coefficients are the values of the conditional probabilities in the dmmhc method. However, I found another interpretation of conditional probabilities based on the webpage https://www.bnlearn.com/examples/fit/, in which you get conditional probabilities once you apply the parameter learning.

f_dt_train <- fold_dt(dt_train, size) fit <- dbnR::fit_dbn_params(net, f_dt_train, method = "mle-g") or fit2 <- bn.fit(net, f_dt_train, method = "mle-g")

Which interpretation do you consider is correct? and specifically for the DMMHC method?

Thanks!

dkesada commented 3 months ago

Hi!

Mhm, this is more a question about the inside workings of bnlearn, but I'll try to answer you to the best of my knowledge.

First, if you want to obtain the correlation coefficients and their p-values, then that is a different thing from the arc.strength() values. You were correct in your assumption that the ci.test() "manual" one-to-one way, as you call it, is probably what you are looking for. This conditional independence tests are performed to check if there is an arc or not during the structure learning process, and you can use this function to test for conditional independence of two variables given another set of variables:

ci.test(x = "A_t_0", y = "B_t_0", z = c("C_t_0", "D_t_0"), data = f_dt, test="cor")

That will give you Pearson's correlation of A_t_0 and B_t_0 given C_t_0 and D_t_0, with the corresponding p-value. If you call the cor() function, this will give you the correlation coefficient of simple independence tests between all pairs of variables. Note that you will not get the p-values, nor will you perform conditioning on another set z of variables. If you are interested in the p-values, I would say that your best bet is to perform this ci.test() calls individually and store those values in a matrix.

Regarding your third point, I'm afraid I don't think I understand correctly what you mean. The conditional probabilites of a network are obtained only after you fit the parameters of the network with either bn.fit() for BNs or fit_dbn_params() for DBNs. What you calculate with the conditional independence tests is whether or not there seems to be a conditional independence relationship between two variables, and so you obtain a p-value that allows you to either reject or accept this hypothesis of there being an arc between the two variables. Only after you have established that there is indeed an arc between these two variables then you can calculate the conditional probability that defines this relationship between them. The parameters obtained after fitting the network are the product of a maximum likelihood estimation, which in the case of Gaussian networks is akin to ordinary least squares regression, but these parameters are not the same as the Pearson's correlation coefficients.

I hope that answers some of your questions.