cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.
GNU General Public License v2.0
404 stars 111 forks source link

Should continuous variables follow Gaussian distribution in the FGES algorithm? #1280

Closed gl97at closed 3 years ago

gl97at commented 4 years ago

Hello, I would like to use the FGES algorithm. In case of a dataset with only continuous variables, I wonder if these variables should necessarily follow Gaussian distribution. If these variables do not follow Gaussian distribution, can the FGES algorithm work properly?

Now, in case of a mixed dataset (that contains categorical variables and numerical (continuous) variables), should all continuous variables of that dataset follow Gaussian distribution?

Thank you in advance!

jdramsey commented 4 years ago

If you do the simulations, what becomes clear is that it's much more important for linear, Gaussian BIC for the relationships between the variables to be linear than for the distribution of the variables to be strictly Gaussian. Have you done the scatterplots? Does it look like you have linear relationships?

--

Let me be clearer. What happens with the linear, non-Gaussian case is that adjacency false positives and arrowhead false positives go down, so if your measure of what's "wrong" with the output graph is how much false information it contains, then this appears simply to go down. The graphs get sparser with fewer orientation errors. If that's OK, then generally you're good. With the nonlinear case this is not true; you can get additional false positive adjacencies and orientations, so more false information.

gl97at commented 4 years ago

At first, thank you very much for your help!

Just to be sure that I have understood it well: it is much more essential to ensure that the relationships between variables are linear (as we get less false information: adjacency and arrowhead false positives) than to ensure that the variables follow Gaussian distribution, right?

As I am new to Tetrad, I have done the scatterplots but I get a very confusing result, just like the following one: Capture That means that there is no linear relationship between the two variables, right? So, if I use the previously mentioned dataset as input to FGES, but my continuous variables are not linearly associated, I am going to have a lot of false positive adjacencies and orientations.

Please let me tell you, that I got a bit confused about the necessity of linearity and/or the presence of Gaussian distribution in the variables, because in this paper: https://www.ncbi.nlm.nih.gov/pubmed/28393106 it is stated that FGES and BIC score are used for Gaussian variables, but in the simulation stage it is stated that variables are linearly associated with independent, identically distributed Gaussian noise. So I wonder which one of them should follow Gaussian distribution, variables or noise.

Thank you in advance!

jdramsey commented 4 years ago

Well, to be fair, the variables in that paper in fact are linearly related with Gaussian distributions. I take it you'e interested in the case where those conditions fail?

At least they're very close, as moist fMRI data is. It is very slightly non-Gaussian, but not much. If you want to apply a non-Gaussian algorithm, you need to preprocess it in perhaps a nonstandard way. Actually, maybe this paper will help you:

ramsey2014.pdf

gl97at commented 4 years ago

Exactly, I am interested in the case where linearity and Gaussian distribution fail. (To be honest, linearity possibly fails). In that case, is FGES an appropriate algorithm or should I find another algorithm?

In order to understand what looks like a dataset ideal for FGES and how the two conditions are applied, are datasets, that were used in the paper: https://www.ncbi.nlm.nih.gov/pubmed/28393106, publicly available? If so, where could I find them?

Thank you very much for the paper!

cg09 commented 4 years ago

The distribution in the Million Variables and More paper is generated from linearly related variables with additive Gaussian noises.

On Tue, Apr 28, 2020 at 4:13 AM gl97at notifications@github.com wrote:

Exactly, I am interested in the case where linearity and Gaussian distribution fail. (To be honest, linearity possibly fails). In that case, is FGES an appropriate algorithm or should I find another algorithm?

In order to understand what looks like a dataset ideal for FGES and how the two conditions are applied, are datasets, that were used in the paper: https://www.ncbi.nlm.nih.gov/pubmed/28393106, publicly available? If so, where could I find them?

Thank you very much for the paper!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1280#issuecomment-620452594, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OMH2FLH4O73X5FCJ7LRO2F2LANCNFSM4MRX3PSA .

jdramsey commented 4 years ago

Clark's right, the data for that paper is linear, Gaussian data; I believe one of us has it on a hard drive still. I could try to get that. I think I misunderstood your question though. I thought you were aware that FGES with the linear, Gaussian BIC score was intended for linear, Gaussian data in principle but were wondering whether it could still be applied in the nonlinear, non-Gaussian case. That's the question I answered; it handles non-Gaussian data pretty well but not so much nonlinear data. I think that's the answer to your question, though. I guess you could further ask what algorithms are correct for nonlinear, non-Gaussian data. Are you interested in that?

gl97at commented 4 years ago

Hello, I am very sorry for the late reply. Thank you very much for your answers.

You made it very clear that FGES with the linear, Gaussian BIC score is intended for linear, Gaussian data. Exactly, I was wondering if FGES could be applied in the nonlinear, non-Gaussian case, too. But, you gave me the answer to that! Exactly, I am interested in (other) algorithms that are correct for nonlinear, non-Gaussian data. Any idea?

PS: However, it would be very helpful if you could find the data for that paper.

jdramsey commented 4 years ago

I have put in a request to the person who might have the data on a hard drive. :)

There are some algorithms that are correct for the nonlinear, non-gaussian case, mostly ones that that take advantage of kernel math. For PC, there is a conditional independence test called KCI (Kernel Conditional Independence), by Kun Zhang, one of a series of such general tests that have been proposed. We have an implementation in Tetrad. It's extraordinarily slow, but if you have a small dataset, it's worth a shot. Kun and a graduate student also came up with a score for GES, for the general case, which I haven't used, but they've gotten good results on that as well, for small problems. So I recommended a dive into the literature for the search terms, "general conditional independence test" or "general score GES" to see what shows up. Like I said, generally they're very slow, so far as I know, unless someone's suggested something new in the last year or so that I haven't seen yet.

gl97at commented 4 years ago

Thank you very much in advance for the dataset you are trying to find! Also, thank you very much for your valuable help. I will definitely start searching for the above-mentioned terms. So, in Tetrad the only appropriate algorithm, for nonlinear & non-gaussian case, is PC with the KCI independence test? You said that Kun Zhang and a graduate student created a score for GES for the general case. Is that provided in Tetrad as well?

jdramsey commented 4 years ago

It is not, unfortunately, and I suspect it would be nontrivial to implement it in Java (as was KCI, which I did!).

Seriously, if you have ideas to add to the literature in this regard, the would be welcome! I don't think a "simple" general test is possible, but you never know; people can be creative when pressed.

There is another test in Tetrad that is "almost" general--CCI (Conditional Correlation Independence). It's much faster, but it does assume an additive model, so it can't be expected to get non-additive cases. But for additive models it can handle the nonlinear, non-Gaussian case. I don't know what terms you're familiar with, so I'll just say that what I mean by the additive case is the case where the function is y = f(x) + eY. A case like y = f(x) * eY would 't be expected to be handled correctly, even in principle, though KCI can handle that case.

gl97at commented 4 years ago

Thank you so much for your valuable help! I am going to try the CCI Test on my dataset and I hope that it will work correctly.