cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.
GNU General Public License v2.0
404 stars 110 forks source link

BPC is missing a critical check #1162

Closed cg09 closed 1 year ago

cg09 commented 5 years ago

Both BPC and FOFC require a preprocessing step that checks to see whether the marginal graphical model on the measured variables is complete--or nearly so--and only runs the latent variable search of BPC or FOFC if it is complete. I have asked Madelyn to look into this problem, but we may have to hand it off.

I fixed this for FOFC--jdramsey

jdramsey commented 5 years ago

@cg09 @mglymour Was this fixed? If so, in what branch? Thanks.

jdramsey commented 4 years ago

@cg09 @mglymour Asking again, was this fixed? I know it was fixed in Madelyn's code, but was that ever pushed to development?

cg09 commented 4 years ago

Only Madelyn would know.

On Thu, Feb 20, 2020 at 10:33 AM Joseph Ramsey notifications@github.com wrote:

@cg09 https://github.com/cg09 @mglymour https://github.com/mglymour Asking again, was this fixed? I know it was fixed in Madelyn's code, but was that ever pushed to development?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162?email_source=notifications&email_token=AD4Y3OIMDD6ISRQAOSVLAS3RD2PD7A5CNFSM4HYRV46KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMOWYFA#issuecomment-589130772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OKWK3RKSLKQZCA7FJ3RD2PD7ANCNFSM4HYRV46A .

cg09 commented 4 years ago

No, we have not implemented a satisfactory check, although we have sketched a procedure.

On Thu, Feb 20, 2020 at 10:33 AM Joseph Ramsey notifications@github.com wrote:

@cg09 https://github.com/cg09 @mglymour https://github.com/mglymour Asking again, was this fixed? I know it was fixed in Madelyn's code, but was that ever pushed to development?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162?email_source=notifications&email_token=AD4Y3OIMDD6ISRQAOSVLAS3RD2PD7A5CNFSM4HYRV46KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMOWYFA#issuecomment-589130772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OKWK3RKSLKQZCA7FJ3RD2PD7ANCNFSM4HYRV46A .

jdramsey commented 4 years ago

OK. I shall wait for truth to be divulged.

jdramsey commented 1 year ago

Making a new branch, check_bpc_and_fofc, to start doing these checks.

jdramsey commented 1 year ago

I added the alpha parameter back in to the interface in this branch.

jdramsey commented 1 year ago

@cg09 You know, there is a much better option available in the FOFC code than checking completeness of PC models for clusters--this was an idea Peter had had, which was to check the significance of the clusters. Let me expose that parameter in the interface for you to play with. I don't know if BPC has that parameter or not, I'll have to check, but it would be feasible to add I think. I think I can just copy the code over from FOFC.

cg09 commented 1 year ago

It might be better if we had a test for sets of tetrad equations. I don't think we do. A test of a single common cause model will depend on linearity. It is not clear that the implication of vanishing tetrads is so restricted.

On Sat, Feb 4, 2023 at 4:15 PM Joseph Ramsey @.***> wrote:

@cg09 https://github.com/cg09 You know, there is a much better option available in the FOFC code than checking completeness of PC models for clusters--this was an idea Peter had had, which was to check the significance of the clusters. Let me expose that parameter in the interface for you to play with. I don't know if BPC has that parameter or not, I'll have to check, but it would be feasible to add I think. I think I can just copy the code over from FOFC.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416851706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OJJOP6EOARMRJ353NTWV3BFPANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

I do fear that the significance of a set of tetrad equations may be very hard to make significant...

jdramsey commented 1 year ago

But I agree with the idea...

cg09 commented 1 year ago

Thanks. Which branch?

On Sat, Feb 4, 2023 at 3:40 PM Joseph Ramsey @.***> wrote:

I added the alpha parameter back in to the interface in this branch.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416845529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OKQPWHDRUFOCFB4BC3WV25DLANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

cg09 commented 1 year ago

Then Peter's proposal would not work very well, I think.

On Sat, Feb 4, 2023 at 4:36 PM Joseph Ramsey @.***> wrote:

I do fear that the significance of a set of tetrad equations may be very hard to make significant...

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416855533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ONM2YUCM3QNP3XKYATWV3DWZANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

.... I could have sworn BPC checked significance of clusters somewhere but it does not appear to be a thing... hmm... let me see if I can figure out some place to put such a check...

jdramsey commented 1 year ago

Oh I see, for FOFC, significance of clusters was only being printed out at th end. I added a step to check significance of clusters are they being created.

jdramsey commented 1 year ago

... I can't figure out where to add that check in BPC; the code is so complex. Will ponder further. Anyway it's in FOFC now.

cg09 commented 1 year ago

I complained about that many times to no effect. But there is a simple check I use: run a search algorithm on the measured variables in the cluster and see if the result is a complete graph. If not, there is something wrong with the single common unmeasured cause representation.

On Sat, Feb 4, 2023 at 4:43 PM Joseph Ramsey @.***> wrote:

.... I could have sworn BPC checked significance of clusters somewhere but it does not appear to be a thing... hmm... let me see if I can figure out some place to put such a check...

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416856692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OK5KSRDQVV5HDUK7WDWV3ERBANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

it's the right idea but it's not working with any of the algorithm I'm trying. I never get complete models for the clusters I know should be there, not even close in many cases. Which is why I thought calculating the model significance of the cluster might be a better idea--then you'd get a single p-value that you can compare to alpha.

Maybe I should you what the test of model significance is; I think once you saw what was being done you'd agree to it. The idea is, say you think {x1, x2, x3, x4} is a cluster. So add a latent, L->{x1, x2, x3, x4}, form that SEM model, estimate it, and return its p-value.

I think it's a sensible thing to do in any case.

What I did just now was to factor out all of the cluster-significance-related code from FOFC into a separate class, ClusterSignificance. That way, if I could figure out how to do the checks in BPC I could just call that same code. And if a better way of testing cluster or clustering or model significance occurs, that could be programmed up as well.

jdramsey commented 1 year ago

Of course now that those checks are moved to a separate file, it would be relatively straightforward to do a clique test instead to see if it's better. Off the top of my head it seems not, but maybe it might work, if you give it only the variables in the cluster itself and and not all the variables in the whole model.

jdramsey commented 1 year ago

There is a method in BPC called "filterAndOrderClusterings"--that looks like a likely place to put a significance check...

jdramsey commented 1 year ago

OK, there's a variables called 'baseListOfClusterings' in the 'filterAndOrderClusterings' method, and we look inside of that to get 'baseClustering's. A 'baseClustering' consists of 'cluster's. We go through each of these clusters and see if they pass some tests. What I'm doing is adding a new criterion to this check--in addition to all of Ricardo's checks, I also check the p-value of each cluster, and if the p-value is below significance I skip the cluster and don't add it to any new clustierngs to print out.

I am about 50% sure that this strategy is OK given the way the code is written, which is hard to understand. I tried it in the GUI and it seems to be helping.

OK, now in ClusterSignificance I will try to add a clique test of some sort as an alternative option and see if that helps.

jdramsey commented 1 year ago

Well it's not an awful option. Let me clean up the code a bit and put the option in the interface for both algorithms. I don't know if I can get you a version today, as there as some pull requests in the queue to accept before development is OK, but hopefully soon.

cg09 commented 1 year ago

No. you generally do not get a complete graph, but you do almost always get a graph specifying conditional independence relations that are impossible with a single common cause. That should be a caution about introducing such a common cause.

On Sat, Feb 4, 2023 at 5:37 PM Joseph Ramsey @.***> wrote:

it' The idea is, say you think {x1, x2, x3, x4} is a cluster. So add a latent, L->{x1, x2, x3, x4}, form that SEM model, estimate it, and return its p-value.

And what do you do if there are only 3 variables in a cluster? Or only two?

I think it's a sensible thing to do in any case.

What I did just now was to factor out all of the cluster-significance-related code from FOFC into a separate class, ClusterSignificance. That way, if I could figure out how to do the checks in BPC I could just call that same code. And if a better way of testing cluster or clustering or model significance occurs, that could be programmed up as well.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416865519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OPW5AMV4KANXDYKJRLWV3K3TANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

cg09 commented 1 year ago

I would like to have the option of testing each of the clusters produced by BPC.

On Sat, Feb 4, 2023 at 6:15 PM Joseph Ramsey @.***> wrote:

OK, there's a variables called 'baseListOfClusterings' in the 'filterAndOrderClusterings' method, and we look inside of that to get 'baseClustering's. A 'baseClustering' consists of 'cluster's. We go through each of these clusters and see if they pass some tests. What I'm doing is adding a new criterion to this check--in addition to all of Ricardo's checks, I also check the p-value of each cluster, and if the p-value is below significance I skip the cluster and don't add it to any new clustierngs to print out.

I am about 50% sure that this strategy is OK given the way the code is written, which is hard to understand. I tried it in the GUI and it seems to be helping.

OK, now in ClusterSignificance I will try to add a clique test of some sort as an alternative option and see if that helps.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416873731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OLCK4FYUZZWEN5O5XTWV3PKBANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Pull request: https://github.com/cmu-phil/tetrad/pull/1486

jdramsey commented 1 year ago

I suppose I could print those out at the end. I don't want to add new GUI features.

cg09 commented 1 year ago

A clique test is less informative than running, say, a grasp search. A grasp search could help identify a proper subset of a cluster that is a clique.when the whole set is not a clique.

On Sat, Feb 4, 2023 at 5:43 PM Joseph Ramsey @.***> wrote:

Of course now that those checks are moved to a separate file, it would be relatively straightforward to do a clique test instead to see if it's better. Off the top of my head it seems not, but maybe it might work, if you give it only the variables in the cluster itself and and not all the variables in the whole model.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416866938, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OKZ3OE3XVY3ST5FH5TWV3LSXANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

cg09 commented 1 year ago

Would it make sense to add a separate function (a new box, or a new menu item in the Estimate box) for testing linear one factor models? What does the FOFC significance test actually test?

On Sat, Feb 4, 2023 at 5:06 PM Joseph Ramsey @.***> wrote:

... I can't figure out where to add that check in BPC; the code is so complex. Will ponder further. Anyway it's in FOFC now.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416860224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OK6Q6MCWMH47KYOEW3WV3HENANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

I disagree about the GRaSP search over the set of variables; I spent time trying to do that and didn't get any cliques.

jdramsey commented 1 year ago

That is, on my problems I didn't get any cliques for latent clusters I knew were there. It was totally uninformative.

jdramsey commented 1 year ago

I think you missed the one message: I'll repeat:

"Maybe I should you what the test of model significance is; I think once you saw what was being done you'd agree to it. The idea is, say you think {x1, x2, x3, x4} is a cluster. So add a latent, L->{x1, x2, x3, x4}, form that SEM model, estimate it, and return its p-value."

cg09 commented 1 year ago

It is a mathematical necessity that when there is a common unmeasured cause of N variables that there are no conditional independencies among the N variables, and the marginal graph on the N variables is adjacency complete. Chance of course may create some conditional independencies, but we cannot defeat chance. So, if you have a 1 factor measurement model and there are conditional independencies among the variables, then chance is at work. But if you find any conditional independencies and want to infer a common cause model, then the presence of marginal conditional independence presents a choice: Either you have had bad luck or the common cause model is false. Of course, if we had error probabilities for the whole sequence of tetrad tests, and likewise for the whole sequence of partial correlation tests, we could go with the alternative with the smaller total error probability. But we do not have such error probabilities.

On Sat, Feb 4, 2023 at 8:51 PM Joseph Ramsey @.***> wrote:

I disagree about the GRaSP search over the set of variables; I spent time trying to do that and didn't get any cliques.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416897923, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OIDVETH44UMHYLK7X3WV4BTBANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

cg09 commented 1 year ago

Yes, I understood that.

On Sat, Feb 4, 2023 at 9:03 PM Joseph Ramsey @.***> wrote:

I think you missed the one message: I'll repeat:

"Maybe I should you what the test of model significance is; I think once you saw what was being done you'd agree to it. The idea is, say you think {x1, x2, x3, x4} is a cluster. So add a latent, L->{x1, x2, x3, x4}, form that SEM model, estimate it, and return its p-value."

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416900220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OOQUSUL43VU2VXZKXLWV4C53ANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Well, in any case what I think I've done in my current branch for this is to insist that FAS applied just to the measures you think are in the cluster return a clique. You can play with it, tell me what you think. I suppose I could make you a snapshot build. I want to print out the evaluation of each final cluster before doing so and the model p-value if it's a legit model, so I guess after I do that. That will be printed to the log.

jdramsey commented 1 year ago

I mean, I understand why in principle you need to get a clique for each single-latent cluster, and from d-separation you do of course--but from any of our algorithms (GRaSP included) you don't even come close to that ideal if you put all of the variables into the same analysis. Many edges that must be there go missing.

cg09 commented 1 year ago

Because the parameters for the common cause are too small.

On Sun, Feb 5, 2023 at 1:06 AM Joseph Ramsey @.***> wrote:

I mean, I understand why in principle you need to get a clique for each single-latent cluster, and from d-separation you do of course--but from any of our algorithms (GRaSP included) you don't even come close to that ideal if you put all of the variables into the same analysis. Many edges that must be there go missing.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1416931128, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ONSFXOLWVUZ47ABF6LWV47MTANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Yeah. BTW I'm starting to think about your idea of adding a menu item to check significance of a single-latent model with the measures you choose in the graph... maybe...

jdramsey commented 1 year ago

@cg09 I'm not sure where to put that functionality yet. The 'isSignificant" method requires a dataset, a cluster, and an alpha. So basically you need as input a dataset/covariance matrix and a set of variables in a variables for that data model, and an alpha. That's it.

Maybe it just does as a menu item in the Data Editor. Not sure.

jdramsey commented 1 year ago

Hmm... I think this is a great idea to put somewhere but I think it's going to take more than haphazard thought, so maybe I'll put it on the wish list and not try to get it into the upcoming release. What I can do for the moment is to print p-values of each cluster in the BPC and FOFC output to the log (and maybe whether they form FAS cliques)..

jdramsey commented 1 year ago

OK, I've added a printout of the p-value of each cluster that is output by BPC or FOFC; this goes to the log, so you'll need to turn logging on to see them. A few other things are printing to the log as well, which are a nuisance; I'll see if I can track down who's printing that stuff. But it is usable.

cg09 commented 1 year ago

Great! Thanks.

On Sun, Feb 5, 2023 at 7:36 PM Joseph Ramsey @.***> wrote:

OK, I've added a printout of the p-value of each cluster that is output by BPC or FOFC; this goes to the log, so you'll need to turn logging on to see them. A few other things are printing to the log as well, which are a nuisance; I'll see if I can track down who's printing that stuff. But it is usable.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1418324134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OME3UBECJ3K43GFKMTWWBBRDANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

cg09 commented 1 year ago

What test do you use/

On Sun, Feb 5, 2023 at 7:04 PM Joseph Ramsey @.***> wrote:

Hmm... I think this is a great idea to put somewhere but I think it's going to take more than haphazard thought, so maybe I'll put it on the wish list and not try to get it into the upcoming release. What I can do for the moment is to print p-values of each cluster in the BPC and FOFC output to the log (and maybe whether they form FAS cliques)..

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1418304888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OKNORYD7ES6WDV7OFLWWA52HANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

cg09 commented 1 year ago

In the Estimator Box?

On Sun, Feb 5, 2023 at 6:21 PM Joseph Ramsey @.***> wrote:

@cg09 https://github.com/cg09 I'm not sure where to put that functionality yet. The 'isSignificant" method requires a dataset, a cluster, and an alpha. So basically you need as input a dataset/covariance matrix and a set of variables in a variables for that data model, and an alpha. That's it.

Maybe it just does as a menu item in the Data Editor. Not sure.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1162#issuecomment-1418294481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OLW63UEWYBSVAREZLLWWAYXZANCNFSM4HYRV46A . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Same test I described earlier. Form the model L->{x1, x2, x2, x4} and estimate it from the data, then report the model p-value.

I'm thinking the estimator box is not the right place, because that assumes as particular SEM PM, which this function does not.

jdramsey commented 1 year ago

@cg09 Bryan says he'll review this code change so there will be at least one other pair of eyes on it. We do have a few more issues we'd like to get into the next release, but I think it should be ready soon.

jdramsey commented 1 year ago

Fixed

https://github.com/cmu-phil/tetrad/pull/1486

Closing.