bd2kccd / causal-cmd

16 stars 8 forks source link

Markov Checker on Causal CMD CLI #100

Open carlosparadis opened 9 months ago

carlosparadis commented 9 months ago

@jdramsey thank you for the follow-up by e-mail! I will post it here so it is easy for everyone to reference. Here's my original question:

I’ve been able to use BOSS via CLI on Causal CMD and the GUI, but I would like to also use the Markov Checker via CLI (taking as input the data + the boss search output). I know it can be used via Py Tetrad (https://github.com/cmu-phil/py-tetrad/blob/d3b73520cb3d27a3211157252129f9b3491ae66f/pytetrad/run_markov_checker.py#L10). Am I missing a flag on Causal CMD CLI or is it not available?

I understand it may be awhile until this is possible, as it would be a different interface for the CLI.


I also had one (new) related question to this which relates to my CLI interest, but perhaps this is better as an issue to rpy-tetrad, let me know. I am primarily an R user, and I noticed r-causal, which interfaced with Causal Command directly via rJava, was replaced by rpy-tetrad, which refers to the Python package, which then refers to Causal Command. I began wrapping R code around Causal Command CLI to stay close to Causal Command updates, but in hindsight, it may have been a better option for me to stay close to its Java API via rpy-tetrad (given this particular issue).

My question was, was the choice of pursuing rpy-tetrad over continue to use rJava directly on r-causal due to rJava overhead or limitations when compared to the alternative?

Thank you!

jdramsey commented 9 months ago

Actually, that's a good question. It would be good to answer that clearly; others may also want to see the answer and comment.

One issue is that with the old causal and by causal packages, two separate projects had to be kept in sync with changing Tetrad versions, and the person who knew how to do that moved on to another job. By making rpy-tetrad dependent on py-tetrad, only one project needs to be updated with changing Tetrad versions. Since our team is small, that's an advantage.

Another issue is that a number of bugs were reported with rJava that we avoided with the reticulate R package combined with the JPype Python package. So far, we have had no technical issues at all with the latter combination. The problem isn't causal-cmd, which works perfectly; it's the rJava connection. The technical issues reported had to do with transferring large datasets from R to Java, which could become slow and take a lot of memory. This is a non-issue with rpy-tetrad.

One issue with rpy-tetrad and py-tetrad is they are still a bit of a bother to install. At some point, we hope that will simplify.

Using py-tetrad instead of pycausal is strongly motivated. The python-java connection in pycausal was super-hard (or impossible) to install on Macs or even Windows; it was primarily designed for Linux. We wanted to target Mac and Windows especially, so that made it difficult. Py-causal using JPype works perfectly on all platforms.

jdramsey commented 9 months ago

By the way, there's no reason the Markov Checker can't be made available in rpy-tetrad. I started to do that the other day but got distracted. Let me know if it would be helpful, and I'll do it.

Basically, it just needs to be wrapped inside some Python code, and then you'll be able to source it in R and use it.

carlosparadis commented 9 months ago

Hi Joe,

Thank you again for the insight on the design decisions and offering to add the interface to Markov Checker via rpy-tetrad. I suspected rJava and the overhead with two tools would have been it, and it makes a lot of sense. To answer your question on rpy-tetrad Markov Checker, I need to ask you one more question: Would you say in general most of the functionality of Tetrad can only be accessed via py-tetrad or rpy-tetrad instead of Causal Command, or you have a preference over one or the other as far as making more Tetrad GUI features available?

I believe waiting on Causal Command interface would be preferred for me as it is "closer to the source". Meaning, there is less a risk that if either Python's JPype or R's reticulate experience issues, then the entire pipeline I build on top of them could be lost. In addition, I started work using Causal Command, but again depending on which interface will more closely follow updates to Tetrad GUI I may change directions. To give you a bit more context on why I am leaning more towards Causal Command in addition to dependency risk:

I've been working with Mike Konrad on performing Causal Search in Software Engineering data. I have an R package I wrote (https://github.com/sailuh/kaiaulu/) that prepares the data Mike uses, but for "automated reproducibility" in our work, I also started writing a very lightweight R interface for Causal Command for our needs (e.g. http://itm0.shidler.hawaii.edu/kumu/articles/random_causality_threshold.html#causal-algorithm). Primarily so everything is in R, and the full analysis can be added in a single R Notebook, say, in a paper submission supplemental material.

My lightweight R interface does a bit more than just call (a very tiny fraction) of Causal Command too. For example, the multi time series time lag transformation we discussed recently via e-mail is done here, and some of the Null Variables approach Mike uses. I also have a parser for the output graph.json and use another visualization to display the causal graph (end of the Notebook). I found very helpful that Causal Command graph.json and knowledge table are interchangeable with Tetrad GUI, which makes easier to compare work with Mike back and forth. Finally, you will notice I most mimic in my Causal Command R functions the way Tetrad GUI presents parameters to users too (since that it is easier for me to compare to Mike's Tetrad GUI pipeline). So, these are the reasons I have been leaning more towards Causal Command. I am not sure if and how I could do the same via rpy-tetrad. Again, I am not too sure if this approach make the most sense. I'd appreciate your thoughts on it.

On a related note to this issue motivation, I noticed there is also a tool called Causal Compare (https://github.com/bd2kccd/causal-compare). I am not sure if adding the Markov Checker would be easier there (seeing it reminded me of the Compare Box in Tetrad GUI where the Markov Checker currently resides). But the input format is a graph.txt. If an extension there made more sense, being able to input a graph.json would be great.