A weird result when use boss and grasp.

cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.

GNU General Public License v2.0

408 stars 111 forks source link

A weird result when use boss and grasp. #1703

Closed creamiracle closed 1 year ago

creamiracle commented 1 year ago

Hey, I'm using py-tetrad, and I find a problem which is not caused by the programming language. The problem is that, I'm using use_sem_bic with boss, and the dataset is 5 discrete cols and 5 continuous cols. In this situation, I use one-hot encoding to handle discretes, and one of them which called "trans_type" has 5 values. After using one-hot, I use data = data.astype({col: "float64" for col in data.columns}) change every cols into float. Then I use boss to search, but I get stuck as below: I try to find out the reason, then I find that when I remove one the "trans_type" encoding column, it works, very weird.

Actually, when

I use ['trans_type_A','trans_type_B','trans_type_C','trans_type_D','trans_type_E,'trans_type_F'], then it stucks.
I use ['trans_type_A','trans_type_B',','trans_type_D','trans_type_E,'trans_type_F'], then it works. (remove type C).

The distribution of type C is: 0.0 13625 1.0 90 Let me mention here, another type has a same distribution, which is type E: 0.0 13620 1.0 95 But type E in dataset, it works, type C in dataset, it stucks.

My code is like this:

import jpype.imports
try:
    jpype.startJVM(classpath=[f"resources/tetrad-current.jar"])
except OSError:
    print("JVM already started")
import pandas as pd
import tools.TetradSearch as search

data = pd.read_csv("resources/test.csv")
data = data.astype({col: "float64" for col in data.columns})

search = search.TetradSearch(data)
search.set_verbose(False)
search.use_sem_bic()
search.run_boss()

G = search.get_causal_learn()
print(G.get_nodes())
print(G.get_graph_edges())
print(G)

BTW, All code has been updated to the newest, so I don't think it's a version error.

I really wanna know why this happen in Boss and Grasp.

Is there any idea about this? Thanks.

cg09 commented 1 year ago

Instead of one-hot coding, try digitaling your categorical variables--which you can do in a Tetrad Data box.

On Mon, Oct 23, 2023 at 10:34 PM Lin Qi @.***> wrote:

Hey, I'm using py-tetrad, and I find a problem which is not caused by the programming language. The problem is that, I'm using use_sem_bic with boss, and the dataset is 5 discrete cols and 5 continuous cols. In this situation, I use one-hot encoding to handle discretes, and one of them which called "trans_type" has 5 values. After using one-hot, I use data = data.astype({col: "float64" for col in data.columns}) change every cols into float. Then I use boss to search, but I get stuck as below: [image: image] https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png I try to find out the reason, then I find that when I remove one the "trans_type" encoding column, it works, very weird.

Actually, when

I use ['trans_type_A','trans_type_B','trans_type_C','trans_type_D','trans_type_E,'trans_type_F'], then it stucks.

I use ['trans_type_A','trans_type_B',','trans_type_D','trans_type_E,'trans_type_F'], then it works. (remove type C).

The distribution of type C is: 0.0 13625 1.0 90 Let me mention here, another type has a same distribution, which is type E: 0.0 13620 1.0 95 But type E in dataset, it works, type C in dataset, it stucks.

My code is like this:

import jpype.imports try: jpype.startJVM(classpath=[f"resources/tetrad-current.jar"]) except OSError: print("JVM already started") import pandas as pd import tools.TetradSearch as search

data = pd.read_csv("resources/test.csv") data = data.astype({col: "float64" for col in data.columns})

search = search.TetradSearch(data) search.set_verbose(False) search.use_sem_bic() search.run_boss()

G = search.get_causal_learn() print(G.get_nodes()) print(G.get_graph_edges()) print(G)

BTW, All code has been updated to the newest, so I don't think it's a version error.

I really wanna know why this happen in Boss and Grasp.

Is there any idea about this? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

creamiracle commented 1 year ago

Instead of one-hot coding, try digitaling your categorical variables--which you can do in a Tetrad Data box. … On Mon, Oct 23, 2023 at 10:34 PM Lin Qi @.> wrote: Hey, I'm using py-tetrad, and I find a problem which is not caused by the programming language. The problem is that, I'm using use_sem_bic with boss, and the dataset is 5 discrete cols and 5 continuous cols. In this situation, I use one-hot encoding to handle discretes, and one of them which called "trans_type" has 5 values. After using one-hot, I use data = data.astype({col: "float64" for col in data.columns}) change every cols into float. Then I use boss to search, but I get stuck as below: [image: image] https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png I try to find out the reason, then I find that when I remove one the "trans_type" encoding column, it works, very weird. Actually, when - I use ['trans_type_A','trans_type_B','trans_type_C','trans_type_D','trans_type_E,'trans_type_F'], then it stucks. - I use ['trans_type_A','trans_type_B',','trans_type_D','trans_type_E,'trans_type_F'], then it works. (remove type C). The distribution of type C is: 0.0 13625 1.0 90 Let me mention here, another type has a same distribution, which is type E: 0.0 13620 1.0 95 But type E in dataset, it works, type C in dataset, it stucks. My code is like this: import jpype.imports try: jpype.startJVM(classpath=[f"resources/tetrad-current.jar"]) except OSError: print("JVM already started") import pandas as pd import tools.TetradSearch as search data = pd.read_csv("resources/test.csv") data = data.astype({col: "float64" for col in data.columns}) search = search.TetradSearch(data) search.set_verbose(False) search.use_sem_bic() search.run_boss() G = search.get_causal_learn() print(G.get_nodes()) print(G.get_graph_edges()) print(G) BTW, All code has been updated to the newest, so I don't think it's a version error. I really wanna know why this happen in Boss and Grasp. Is there any idea about this? Thanks. — Reply to this email directly, view it on GitHub <#1703>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA . You are receiving this because you are subscribed to this thread.Message ID: @.>

Thanks for replying, but when I load the data in tetrad gui, I don't find a way to digital it, where is it? or what it use for digital? like label encoder or target encoder?

cg09 commented 1 year ago

Sorry, I was tired and unclear. In a new Data box, there is a menu item "Convert numerical discrete to continuous". Try that. [image: Screenshot 2023-10-24 at 3.29.05 AM.png]

On Tue, Oct 24, 2023 at 2:38 AM Lin Qi @.***> wrote:

Instead of one-hot coding, try digitaling your categorical variables--which you can do in a Tetrad Data box. … <#m7440108050224310007> On Mon, Oct 23, 2023 at 10:34 PM Lin Qi @.> wrote: Hey, I'm using py-tetrad, and I find a problem which is not caused by the programming language. The problem is that, I'm using use_sem_bic with boss, and the dataset is 5 discrete cols and 5 continuous cols. In this situation, I use one-hot encoding to handle discretes, and one of them which called "trans_type" has 5 values. After using one-hot, I use data = data.astype({col: "float64" for col in data.columns}) change every cols into float. Then I use boss to search, but I get stuck as below: [image: image] https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png I try to find out the reason, then I find that when I remove one the "trans_type" encoding column, it works, very weird. Actually, when - I use ['trans_type_A','trans_type_B','trans_type_C','trans_type_D','trans_type_E,'trans_type_F'], then it stucks. - I use ['trans_type_A','trans_type_B',','trans_type_D','trans_type_E,'trans_type_F'], then it works. (remove type C). The distribution of type C is: 0.0 13625 1.0 90 Let me mention here, another type has a same distribution, which is type E: 0.0 13620 1.0 95 But type E in dataset, it works, type C in dataset, it stucks. My code is like this: import jpype.imports try: jpype.startJVM(classpath=[f"resources/tetrad-current.jar"]) except OSError: print("JVM already started") import pandas as pd import tools.TetradSearch as search data = pd.read_csv("resources/test.csv") data = data.astype({col: "float64" for col in data.columns}) search = search.TetradSearch(data) search.set_verbose(False) search.use_sem_bic() search.run_boss() G = search.get_causal_learn() print(G.get_nodes()) print(G.get_graph_edges()) print(G) BTW, All code has been updated to the newest, so I don't think it's a version error. I really wanna know why this happen in Boss and Grasp. Is there any idea about this? Thanks. — Reply to this email directly, view it on GitHub <#1703 https://github.com/cmu-phil/tetrad/issues/1703>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA . You are receiving this because you are subscribed to this thread.Message ID: @.>

Thanks for replying, but when I load the data in tetrad gui, I don't find a way to digital it, where is it? or what it use for digital? like label encoder or target encoder?

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1703#issuecomment-1776614031, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OJFXVKYZSPDZ7AOZQ3YA5O4RAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGYYTIMBTGE . You are receiving this because you commented.Message ID: @.***>

creamiracle commented 1 year ago

Sorry, I was tired and unclear. In a new Data box, there is a menu item "Convert numerical discrete to continuous". Try that. [image: Screenshot 2023-10-24 at 3.29.05 AM.png] … On Tue, Oct 24, 2023 at 2:38 AM Lin Qi @.> wrote: Instead of one-hot coding, try digitaling your categorical variables--which you can do in a Tetrad Data box. … <#m7440108050224310007> On Mon, Oct 23, 2023 at 10:34 PM Lin Qi @.> wrote: Hey, I'm using py-tetrad, and I find a problem which is not caused by the programming language. The problem is that, I'm using use_sem_bic with boss, and the dataset is 5 discrete cols and 5 continuous cols. In this situation, I use one-hot encoding to handle discretes, and one of them which called "trans_type" has 5 values. After using one-hot, I use data = data.astype({col: "float64" for col in data.columns}) change every cols into float. Then I use boss to search, but I get stuck as below: [image: image] https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png I try to find out the reason, then I find that when I remove one the "trans_type" encoding column, it works, very weird. Actually, when - I use ['trans_type_A','trans_type_B','trans_type_C','trans_type_D','trans_type_E,'trans_type_F'], then it stucks. - I use ['trans_type_A','trans_type_B',','trans_type_D','trans_type_E,'trans_type_F'], then it works. (remove type C). The distribution of type C is: 0.0 13625 1.0 90 Let me mention here, another type has a same distribution, which is type E: 0.0 13620 1.0 95 But type E in dataset, it works, type C in dataset, it stucks. My code is like this: import jpype.imports try: jpype.startJVM(classpath=[f"resources/tetrad-current.jar"]) except OSError: print("JVM already started") import pandas as pd import tools.TetradSearch as search data = pd.read_csv("resources/test.csv") data = data.astype({col: "float64" for col in data.columns}) search = search.TetradSearch(data) search.set_verbose(False) search.use_sem_bic() search.run_boss() G = search.get_causal_learn() print(G.get_nodes()) print(G.get_graph_edges()) print(G) BTW, All code has been updated to the newest, so I don't think it's a version error. I really wanna know why this happen in Boss and Grasp. Is there any idea about this? Thanks. — Reply to this email directly, view it on GitHub <#1703 <#1703>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA . You are receiving this because you are subscribed to this thread.Message ID: @.> Thanks for replying, but when I load the data in tetrad gui, I don't find a way to digital it, where is it? or what it use for digital? like label encoder or target encoder? — Reply to this email directly, view it on GitHub <#1703 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OJFXVKYZSPDZ7AOZQ3YA5O4RAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGYYTIMBTGE . You are receiving this because you commented.Message ID: @.>

Really thanks, I will try to do it, and another question, is there a way that tetrad can output a adjacency_matrix ? I find this function in causal-learn but not in tetrad, if it has, please tell me. Thanks again.

cg09 commented 1 year ago

is there a way that tetrad can output a adjacency_matrix ? --Not that I know of.

On Tue, Oct 24, 2023 at 3:47 AM Lin Qi @.***> wrote:

Sorry, I was tired and unclear. In a new Data box, there is a menu item "Convert numerical discrete to continuous". Try that. [image: Screenshot 2023-10-24 at 3.29.05 AM.png] … <#m-1804964085881779636> On Tue, Oct 24, 2023 at 2:38 AM Lin Qi @.*> wrote: Instead of one-hot coding, try digitaling your categorical variables--which you can do in a Tetrad Data box. … <#m7440108050224310007> On Mon, Oct 23, 2023 at 10:34 PM Lin Qi @.> wrote: Hey, I'm using py-tetrad, and I find a problem which is not caused by the programming language. The problem is that, I'm using use_sem_bic with boss, and the dataset is 5 discrete cols and 5 continuous cols. In this situation, I use one-hot encoding to handle discretes, and one of them which called "trans_type" has 5 values. After using one-hot, I use data = data.astype({col: "float64" for col in data.columns}) change every cols into float. Then I use boss to search, but I get stuck as below: [image: image] https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png https://user-images.githubusercontent.com/14272291/277530736-6dda706b-d367-4184-b5ac-1f0fd74df7f9.png I try to find out the reason, then I find that when I remove one the "trans_type" encoding column, it works, very weird. Actually, when - I use ['trans_type_A','trans_type_B','trans_type_C','trans_type_D','trans_type_E,'trans_type_F'], then it stucks. - I use ['trans_type_A','trans_type_B',','trans_type_D','trans_type_E,'trans_type_F'], then it works. (remove type C). The distribution of type C is: 0.0 13625 1.0 90 Let me mention here, another type has a same distribution, which is type E: 0.0 13620 1.0 95 But type E in dataset, it works, type C in dataset, it stucks. My code is like this: import jpype.imports try: jpype.startJVM(classpath=[f"resources/tetrad-current.jar"]) except OSError: print("JVM already started") import pandas as pd import tools.TetradSearch as search data = pd.read_csv("resources/test.csv") data = data.astype({col: "float64" for col in data.columns}) search = search.TetradSearch(data) search.set_verbose(False) search.use_sem_bic() search.run_boss() G = search.get_causal_learn() print(G.get_nodes()) print(G.get_graph_edges()) print(G) BTW, All code has been updated to the newest, so I don't think it's a version error. I really wanna know why this happen in Boss and Grasp. Is there any idea about this? Thanks. — Reply to this email directly, view it on GitHub <#1703 https://github.com/cmu-phil/tetrad/issues/1703 <#1703 https://github.com/cmu-phil/tetrad/issues/1703>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA https://github.com/notifications/unsubscribe-auth/AD4Y3ON4F7UJHZ7ETRL5BBTYA4SMJAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQMZWGM3TQNA . You are receiving this because you are subscribed to this thread.Message ID: @.> Thanks for replying, but when I load the data in tetrad gui, I don't find a way to digital it, where is it? or what it use for digital? like label encoder or target encoder? — Reply to this email directly, view it on GitHub <#1703 (comment) https://github.com/cmu-phil/tetrad/issues/1703#issuecomment-1776614031>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OJFXVKYZSPDZ7AOZQ3YA5O4RAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGYYTIMBTGE https://github.com/notifications/unsubscribe-auth/AD4Y3OJFXVKYZSPDZ7AOZQ3YA5O4RAVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGYYTIMBTGE . You are receiving this because you commented.Message ID: @.***>

Really thanks, I will try to do it, and another question, is there a way that tetrad can output a adjacency_matrix ? I find this function in causal-learn but not in tetrad, if it has, please tell me. Thanks again.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1703#issuecomment-1776694695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ONW2NIKGZ7ZD5KVV4DYA5XB3AVCNFSM6AAAAAA6M7SLUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGY4TINRZGU . You are receiving this because you commented.Message ID: @.***>

jdramsey commented 1 year ago

@jdramsey here, let me give it a shot, and see if I can shed any light. Actually, I wasn't aware that this was called "one-hot encoding," a new term for an old idea. In my world, this is called "making indicator variables." I've been doing this for years, though there is a trick to it if you want to do a causal search treating the system as linear. One issue for scoring or independence testing is that the one-hot encoding (as I understand it from Google) goes too far. For instance (do you agree this is a problem?) if you have

Type:
Cat
Dog
Cat
Dog

What would your one-hot coding do? Would it produce one binary column or two? It could do this:


TypeCat          TypeDog
1                0
0                1
1                0
0                1

This would be a mistake so far as causal search is concerned if you treat the data as linear and continuous, because now TypeDog = 1 - TypeCat, meaning that there will be a singularity in the matrix since you know the columns have to sum to exactly 1 in each row--i.e. there is a linear dependence of one column on the rest. So if you try to invert the covariance matrix over just the variables [TypeDog, TypeCat] a SingularityException will be thrown. (For theory for this, see any decent textbook on linear algebra.)

What you need to do is include just one of these columns, since the other is implied :

If your categorical variable has 3 categories, you would include 2 columns, and so on. This is just a general problem where you have columns that induce singularities--you always need to remove enough columns to remove the singularities if you're going to analyze the data as linear, because the algorithms will be trying to invert submatrices of the total covariance matrix, and that will throw SingularityExceptions if you don't.

This is not really just a problem for causal search; it's a problem anywhere matrix inversion is being used to do any type of scoring or testing or whatever, so the two-column approach above will only work in situations where you don't do that. You could do generalized matrix inversion, but this is much slower as it involves more linear algebra. Much better to just remove the columns and do regular matrix inversion.

Actually, if you just load the data as mixed and then use the mixed-type Degenerate Gaussian score will remove the extraneous indicators for you. So that's a strategy. That is, just load the data as mixed-type (i.e., with both continuous and discrete columns, including your categorical variables) and then just do a search using the Degenerate Gaussian score. Another mixed-type score we have in Tetrad is the Conditional Gaussian, which works a different way. You could try both. But for this, you would need to not do the one-hot encoding, load the data as mixed, and then just run the search the usual way.

The fact that the search is taking a long time to fail when the singularity exceptions are being thrown is an issue. I'll look into that. The fact that the search is having trouble, though, is I think for the above reason.

jdramsey commented 1 year ago

Let me give you the papers for the mixed-type scores we have in Tetrad--you can look at them and see what you think.

Andrews, B., Ramsey, J., & Cooper, G. F. (2018). Scoring Bayesian networks of mixed variables. International journal of data science and analytics, 6, 3-18.

Andrews, B., Ramsey, J., & Cooper, G. F. (2019, July). Learning high-dimensional directed acyclic graphs with mixed data types. In The 2019 ACM SIGKDD Workshop on Causal Discovery (pp. 4-21). PMLR.

creamiracle commented 1 year ago

@jdramsey here, let me give it a shot, and see if I can shed any light. Actually, I wasn't aware that this was called "one-hot encoding," a new term for an old idea. In my world, this is called "making indicator variables." I've been doing this for years, though there is a trick to it if you want to do a causal search treating the system as linear. One issue for scoring or independence testing is that the one-hot encoding (as I understand it from Google) goes too far. For instance (do you agree this is a problem?) if you have
Type:
Cat
Dog
Cat
Dog
What would your one-hot coding do? Would it produce one binary column or two? It could do this:
TypeCat          TypeDog
1                0
0                1
1                0
0                1
This would be a mistake so far as causal search is concerned if you treat the data as linear and continuous, because now TypeDog = 1 - TypeCat, meaning that there will be a singularity in the matrix since you know the columns have to sum to exactly 1 in each row--i.e. there is a linear dependence of one column on the rest. So if you try to invert the covariance matrix over just the variables [TypeDog, TypeCat] a SingularityException will be thrown. (For theory for this, see any decent textbook on linear algebra.)

What you need to do is include just one of these columns, since the other is implied :
TypeCat
1                
0                
1                 
0                
If your categorical variable has 3 categories, you would include 2 columns, and so on. This is just a general problem where you have columns that induce singularities--you always need to remove enough columns to remove the singularities if you're going to analyze the data as linear, because the algorithms will be trying to invert submatrices of the total covariance matrix, and that will throw SingularityExceptions if you don't.

This is not really just a problem for causal search; it's a problem anywhere matrix inversion is being used to do any type of scoring or testing or whatever, so the two-column approach above will only work in situations where you don't do that. You could do generalized matrix inversion, but this is much slower as it involves more linear algebra. Much better to just remove the columns and do regular matrix inversion.

Actually, if you just load the data as mixed and then use the mixed-type Degenerate Gaussian score will remove the extraneous indicators for you. So that's a strategy. That is, just load the data as mixed-type (i.e., with both continuous and discrete columns, including your categorical variables) and then just do a search using the Degenerate Gaussian score. Another mixed-type score we have in Tetrad is the Conditional Gaussian, which works a different way. You could try both. But for this, you would need to not do the one-hot encoding, load the data as mixed, and then just run the search the usual way.

The fact that the search is taking a long time to fail when the singularity exceptions are being thrown is an issue. I'll look into that. The fact that the search is having trouble, though, is I think for the above reason.

Thanks, that really helps a lot, it maybe the reason why I can't get the result. But I still have 2 questions about encoding. 1.If I donot use this indicator variables method, and use label encoder, for example, [cat, dog, bird] to [1,2,3], will the number influence the result? like in this situation, dog is twice than cat, 2*cat=dog, just a instance. 2.If I remove one of the columns after making indicator variables, how do I know the realation between this remove col to others, for instance in upper example, I keep type_cat and type_dog, then how I know the type_bird influence? Thanks.

jdramsey commented 1 year ago

Hold on, I got busy for a couple of days. Let me think for a second.

jdramsey commented 1 year ago

But I still have 2 questions about encoding. 1.If I donot use this indicator variables method, and use label encoder, for example, [cat, dog, bird] to [1,2,3], will the number influence the result? like in this situation, dog is twice than cat, 2*cat=dog, just a instance.

That relationship is not so important; it won't cause singularity exceptions when you invert. Also, if you keep the string categories in Tetrad and use the Conditional Gaussian or Degenerate Gaussian score, all of the details of forming a causal model will be handled for you. The Degenerate Gaussian score, in particular, will make the indicator variables for you, without the redundant ones, and keep track of which columns are related to which variables for you so the you end up with a causal graph over the variables instead of just over the variable categories.

2.If I remove one of the columns after making indicator variables, how do I know the relation between this remove col to others, for instance, in the upper example, I keep type_cat and type_dog, then how do I know the type_bird influence?

I think you can break down the problem into two steps. The first step is to estimate the graph of causal relationships over the variables, and the second step is to "assign blame" to particular categories. Let me think about that problem for a bit.

jdramsey commented 1 year ago

By the way, if you have binary data, Clark's solution of treating the data as 0/1 for linear Gaussian scores/tests will always work. Also, if you have ordinal data, converting it to numbers indicating the order will work as well. The whole issue for the above has mainly to do with nominal discrete variables with more than 2 categories.