bd2kccd / py-causal

Other
205 stars 50 forks source link

Different causal graphs for the same algorithm parameters #91

Closed bhavyaghai closed 4 years ago

bhavyaghai commented 4 years ago

I am trying to infer causal graph using PC All algorithm using the following command:-

tetrad.run(algoId = 'pc-all', dfs = df, testId = 'cg-lr-test', alpha = 0.01, dataType = 'mixed', numCategoriesToDiscretize = 7, discretize = False, concurrentFAS = True, maxPathLength = 0, conflictRule = 1, colliderDiscoveryRule = 1)

Executing the above command yields a different causal graph each time. Is there is some randomness involved? If so, is there any seed value that we can set so that we get the same result each time?

Furthermore, is there any documentation to read more about different types of parameters that can be passed and what they mean?

jdramsey commented 4 years ago

OK, you're making me think. You're using conditional Gaussian; I looked at the code there, and there is no randomness involved. PC is not random. PC in my branch has some randomness to it, but the version in the development branch doesn't. There is one possibility that I can think of, which is that the adjacency step of PC is not order-independent; if you change the order of the variables, it can output different results. You can tell if this is the problem by switching to PC-Stable. Now mind you, PC-stable still does not eliminate all variation in orientation, so first check to see if the adjacencies stabilize.

This paper is the reference: Colombo, Diego, and Marloes H. Maathuis. "Order-independent constraint-based causal structure learning." The Journal of Machine Learning Research 15.1 (2014): 3741-3782.

Internally, there is a parameter called "randomize columns"; if you can get at that and set it to false, you should be able to eliminate any variation from this cause.

That's my first guess.

chirayukong commented 4 years ago

You can look up to the parameters of the PC-Stable at http://cmu-phil.github.io/tetrad/manual/#search_box. We share the same search parameters all across Tetrad applications and others. You can also call the PC-Stable directly to the Tetrad jar file. The example is here: https://github.com/bd2kccd/py-causal/blob/development/example/javabridge/Calling%20Directly%20Py-Causal%20PC-Stable%20Example.ipynb

bhavyaghai commented 4 years ago

@jdramsey @chirayukong I went through the Tetrad manual & tried playing with the following parameters:- stableFAS, fasRule (for PC Stable), randomizeColumns, colliderDiscoveryRule, etc. I tried setting stableFAS=true & fasRule=2 or 3 which should result in 'PC Stable' algorithm. Furthermore, I also set randomizeColumns=False.

So far, I am still getting randomness in causal graphs. What can be other reasons?

For my use case, getting consistent results is pretty important. Please guide me on what should I do next.

jdramsey commented 4 years ago

Wait, you have randomize columns = false and you still get variation from algorithms that don't introduce randomness? How??

Is the syntax right? Sorry, I'm not familiar with that way of running tetrad--maybe I can get advice from someone.

bhavyaghai commented 4 years ago

I am also surprised as to why it's happening. To give more insight, I am using the following command:-

tetrad.run(algoId = 'pc-all', dfs = df, testId = 'cg-lr-test', alpha = 0.05, dataType = 'mixed', numCategoriesToDiscretize = 7, discretize = False, conflictRule = 2, colliderDiscoveryRule = 2, depth=-1, randomizeColumns=False, stableFAS= True, fasRule=3, concurrentFAS = True)

Attached is two versions of the causal graph generated from the above command.

Screenshot from 2019-12-24 17-29-14 Screenshot from 2019-12-24 17-28-21

jdramsey commented 4 years ago

I'm kind of busy until Thursday morning, but I can try downloading this version and see if I can't get this to happen for me. This is R-causal? Sorry, I usually work in Java.

Maybe I can start there, see if I can get it to happen in Java with the algorithms and tests you're using. I take it this is mixed data. That would be a start--see if it happens in the underlying Java, try to localize the problem.

bhavyaghai commented 4 years ago

This is Py-causal. Looking forward to your reply

jdramsey commented 4 years ago

Could you maybe send me a sample data set that's causing the problem? It might be something about the data. To here: jdramsey@andrew.cmu.edu.

Happy holiday.

Joe

On Tue, Dec 24, 2019 at 7:58 PM Bhavya Ghai notifications@github.com wrote:

This is Py-causal. Looking forward to your reply

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bd2kccd/py-causal/issues/91?email_source=notifications&email_token=ACLFSR3JGMP46PP4GGJRSODQ2KV2BA5CNFSM4J3EW5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHTXA7Y#issuecomment-568815743, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACLFSR7OT3J6Z2ZBHMZN5BDQ2KV2BANCNFSM4J3EW5KQ .

-- Joseph D. Ramsey Special Faculty and Director of Research Computing Department of Philosophy 135 Baker Hall Carnegie Mellon University Pittsburgh, PA 15213

jsph.ramsey@gmail.com Office: (412) 268-8063 http://www.andrew.cmu.edu/user/jdramsey

bhavyaghai commented 4 years ago

@jdramsey Sent. Please check your email

Happy Holiday

chirayukong commented 4 years ago

@bhavyaghai could you re-pull the py-causal? I updated its library and it might relate to the causal-cmd issue https://github.com/bd2kccd/causal-cmd/issues/48.

bhavyaghai commented 4 years ago

@jdramsey I installed the latest py-causal using pip install git+git://github.com/bd2kccd/py-causal but the issue still exists.

jdramsey commented 4 years ago

I loaded your data into the Tetrad interface (in my work branch), loaded it using the mixed data loader with max discrete categories = 2, and ran PC-stable. I then did it again and go the same result. I then closed Tetrad and re-launched it and did it again and got the same result. So I don't think the problem with the algorithm.

So I tested to see whether it was the data loader. I loaded the data again, in the same way, and ran PC-stable again. I got the same result. So I don't think it's the data loader.

I was wondering if it was a problem with a random seed. So I closed Tetrad again and re-opened it, which has the effect of resetting the random seed. I then loaded the data in the same way and ran PC-stable and got the same result. So I don't think it's a problem with a random seed.

Basically, I'm not getting any of this kind of variations in the Tetrad interface. So now I'm wondering if Kong fixed it by resetting the Tetrad library for R-python?

jdramsey commented 4 years ago

By the way, I realized there should be no need to set the randomize columns flag as I had suggested. You're loading the data in from a file.

@chirayukong

bhavyaghai commented 4 years ago

@jdramsey I agree with you that the tetrad interface generates the same causal graph each time. It seems the problem is with the python api. @chirayukong I pulled the latest py-causal library but I am still getting different causal graphs. Can you please look into it?

chirayukong commented 4 years ago

@bhavyaghai have you tried to run PC-stable directly from the Tetrad-lib library? Here the example of how to do it: https://github.com/bd2kccd/py-causal/blob/development/example/javabridge/Calling%20Directly%20Py-Causal%20PC-Stable%20Example.ipynb

The py-causal calls the algorithm from the Tetradrunner interface from the causal-cmd package. @kvb2univpitt could you check that causal-cmd about this issue?

bhavyaghai commented 4 years ago

@chirayukong I tried running PC-Stable directly as shown in the notebook you referred to. It gives consistent results on every run which is exactly what I want. But using Tetrad-lib library is not as straightforward as py-causal. I don't know how to set different hyperparameters, specify column type, independence tests, etc. Furthermore, I don't want to deal with java directly. So can you please look into py-causal and fix it so that I don't have to use Tetrad-lib directly?

Thanks!

bhavyaghai commented 4 years ago

@chirayukong Running tetrad via javabridge is significantly slower than py-causal especially the data loading part. I think it's also more memory intensive. For my current research problem, it will be great if I could use py-causal without the randomness issue. Is it possible for you to give a timeframe on when this bug will be fixed?

chirayukong commented 4 years ago

@bhavyaghai so sorry for the delay but I have been working in another grant at the moment. I can point out where you can use the built-in data loader of py-causal, which doesn't require much of the memory but you need to load from the text file. Here it is:

            # Read Data from File
            f = javabridge.JClassWrapper('java.io.File')(temp_data_path)
            path = f.toPath()
            delimiter = javabridge.get_static_field('edu/pitt/dbmi/data/reader/Delimiter','TAB','Ledu/pitt/dbmi/data/reader/Delimiter;')
            dataReader = javabridge.JClassWrapper('edu.pitt.dbmi.data.reader.tabular.MixedTabularDatasetFileReader')(path,delimiter,numCategoriesToDiscretize)
            tetradData = dataReader.readInData()
            tetradData = javabridge.static_call('edu/cmu/tetrad/util/DataConvertUtils','toDataModel','(Ledu/pitt/dbmi/data/reader/Data;)Ledu/cmu/tetrad/data/DataModel;', tetradData)

https://github.com/bd2kccd/py-causal/blob/6201db7d21aa453a8b66bf632db930417209a430/src/pycausal/pycausal.py#L68