bd2kccd / causal-cmd

16 stars 8 forks source link

Large calculation time difference between r-causal and causal-cmd v1.5.0 #83

Closed yasu-sh closed 1 year ago

yasu-sh commented 1 year ago

I am wondering why the calculation time has big difference between as of 2019 and as of 2023. Could you tell me why this big difference happens? algorithms is different? I have used profiling tool at IntelliJ IDEA at causal-cmd v.1.5.0. but the process looks normal and test consumes log-gamma calculation a lot.

r-causal based on causal-cmd ver. 1.2.0-SNAPSHOT

image image

[1] "INFO No prior knowledge for TETRAD." [1] "r-causal used Parameters: numberResampling = 0 / samplePrior = 10 / structurePrior = 1 / faithfulnessAssumed = false / verbose = false / maxDegree = 6 / symmetricFirstStep = true / printStream = " [1] "INFO structure learning completed. user/sys/elapsed duraton: 3.03 / 0.01 / 3.33" Actual turn around time: 3.33 secs.

causal-cmd based on the @kvb2univpitt 's release

[1] "INFO Causal structure learning started. Recipe: 5 | Algorithm: fges" [1] "INFO structure learning with causal-cmd started." [1] "INFO No prior knowledge for TETRAD." [1] "javaw -Xmx3G -Xss4096k -jar causal-cmd-1.5.0-jar-with-dependencies.jar --data-type discrete --delimiter comma --score bdeu-score --json-graph --maxDegree 6 --algorithm fges --dataset dt.tetrad.csv --prefix c-tetrad --knowledge prior.txt --priorEquivalentSampleSize 10 --numberResampling 0 --parallelized --symmetricFirstStep" [1] "INFO structure learning completed. user/sys/elapsed duraton: 0.19 / 0.64 / 100.25" Actual turn around time: 100 secs.

no prior (no information at prior.txt) dataset: hailfinder (https://github.com/bd2kccd/causal-cmd/files/10555256/dt.tetrad.csv)

condition is same as below, including dataset.: https://github.com/bd2kccd/causal-cmd/issues/80#issuecomment-1411716409

The process aborted since the profiling makes much longer.

image

jdramsey commented 1 year ago

One thing that may be nice from your point of view for R is that we wrote a method to save out graphs in the "endpoint matrix" format of PCALG--it's returned as a numpy array, though I may change that to a pandas data frame so the variable names can be returned as well; you can always get a numpy array from the data frame if you don't need the variable names. But that might be the beginning of a way to interface Java with R in a more up-to-date way.

jdramsey commented 1 year ago

Anyway, first things first; we'd like to make it very easy to use Tetrad from Python, much easier than currently, and py-tetrad seems to promise that.

yasu-sh commented 1 year ago

@jdramsey I have confirmed working py-tetrad at my windows machine. It looks working well as long as python environment only.

  1. JVM: ver. 11 OpenJDK Runtime Environment Temurin-11.0.18+10 (build 11.0.18+10)
  2. JAVA_HOME is set as above JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-11.0.18.10-hotspot\
  3. Installed Python version 3.10 Python 3.10.10 (tags/v3.10.10:aad5f6a, Feb 7 2023, 17:20:36) [MSC v.1929 64 bit (AMD64)] on win32
  4. Installed causal-learn(0.1.3.3) by pip pip install causal-learn
  5. Installed JPype(1.4.1) by pip pip install JPype1
  6. Cloned the github of py-tetrad

The location of 'file' is here. E:\PyProjects\py-tetrad\examples.. E:\PyProjects\py-tetrad\examples\run_searches_on_discrete_data.py

jdramsey commented 1 year ago

Huh, interesting! OK, I learned something. Feel free to let me know if you find any valuable tweaks. But the Java version and the Python version are good, based on what I was expecting.

I think I know what you mean by "python environment only"--it would be nice if I could figure out how to publish the project in PyPl. I'll have to figure out how to fix the paths, but I can look at other projects.

And I promise I will start thinking about R again.

yasu-sh commented 1 year ago

This was cross-comment on "python environment"... Sorry. (this was added later)

-- I notice that there are different way to call py-tetrad from my exisitng R codes. Which do you recommend? I think the simplest way is the same as causal-cmd. ie 1.

Possible ways from R:

  1. Calls py-tetrad from command prompt, like causal-cmd
  2. Calls by using reticulate I/F (https://rstudio.github.io/reticulate/). it result in an error due to file's discrepancy,
 __file__ returns:
> case1: virtualenv - current virtualenv's activate_this.py. 
> case2: plain python.exe - not defined.

Questions:

  1. How to provide prior knowledge information for py-tetrad?
  2. How to output json-graph file after search?
yasu-sh commented 1 year ago

py-tetrad ran with PyCharm CE on Win10 at Japanese locale! @jdramsey Thank you for your help! image

jdramsey commented 1 year ago

That's great! :-D I'm glad it worked!

It's good to know there are some options for R; I'll try them out when I can. Our old project, https://github.com/bd2kccd/r-causal, used the command line option, but I still need to look closely at it. (Unfortunately, the guy who was maintaining that code left us, and we haven't had a chance to work through what he did, but I will.) The current problem is that it uses an old version of Tetrad, and a lot of the code needs to be updated; in that sense accessing Tetrad through Python would be a great option if it could work, though that does involve coordinating three languages.

Also, I am curious to know whether your problems are Windows-specific yet. Maybe I could get it to work on my Mac.

jdramsey commented 1 year ago

Looking briefly at the Reticulate docs, it seems I could solve the file path problem by figuring out how to publish py-tetrad to PyPl. I've never done that before, but maybe I can figure it out. Then you can just.

import pytetrad,

and then continue using the functionality in your project from there. I'm still learning how to do these sorts of things in Python.

yasu-sh commented 1 year ago

@jdramsey Thank you for your investigation and document check. I feel that the package-reticulate could require more time than simple call from console when breaking down issues.

For example, the latest version of reticulate(v.1.28-compiled on my PC) makes bombing error when calling python. But stable version(v 1.24-cran binary) works fine when calling python at my PC.

ex. https://github.com/rstudio/reticulate/issues/1258 I guess there are many similar incompatiblities. ex. my case:

R console
> sys$executable
[1] "C:\\Program Files\\RStudio\\resources\\app\\bin\\rsession.exe"

IPython
>>>  sys.executable
'C:\\Python\\venv\\pytetrad\\Scripts\\python.exe'

command prompt at R console

> library(reticulate)
> setwd("E:/PyProjects/py-tetrad")
> py_run_file("./examples/run_searches_on_discrete_data.py")
Error in py_run_file_impl(file, local, convert) : 
  FileNotFoundError: [Errno 2] No such file or directory: 'C:/Python/venv/pytetrad/Scripts\\../examples/resources/bridges.data.version211_rev.txt'

Detailed traceback:
  File "<string>", line 30, in <module>
  File "C:\Python\venv\pytetrad\lib\site-packages\pandas\io

This is return value of file

C:/Python/venv/pytetrad/Scripts.. C:/Python/venv/pytetrad/Scripts/activate_this.py

R environment information:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932    LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C                   LC_TIME=Japanese_Japan.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reticulate_1.24

loaded via a namespace (and not attached):
[1] compiler_4.0.2  Matrix_1.2-18   tools_4.0.2     Rcpp_1.0.7      grid_4.0.2      jsonlite_1.7.2  png_0.1-7       lattice_0.20-41

Python environment information

> py_config()
python:         C:/Python/venv/pytetrad/Scripts/python.exe
libpython:      E:/Program Files/Python310/python310.dll
pythonhome:     C:/Python/venv/pytetrad
virtualenv:     C:/Python/venv/pytetrad/Scripts/activate_this.py
version:        3.10.10 (tags/v3.10.10:aad5f6a, Feb  7 2023, 17:20:36) [MSC v.1929 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Python/venv/pytetrad/Lib/site-packages/numpy
numpy_version:  1.24.2

NOTE: Python version was forced by RETICULATE_PYTHON
yasu-sh commented 1 year ago

@jdramsey This is my understanding on several ways to use tetrad at console / other packages in Python/R. This information may be beneficial for you or not. But I thought it contributes reducing time from your side.

Project Calling place jar file I/F Package I/F Library Concerns / Notices
r-causal R causal-cmd rJava JNI Reflection needs to be avoid in for-loops.
py-tetrad R tetrad reticulate/JPype JNI needs to care about reticulate's incompatibility
py-tetrad Python tetrad JPype JNI NA
py-causal Python causal-cmd Javabridge JNA NA
causal-cmd Console causal-cmd None None Need file access / for general usage

r-causal: Need to avoid reflection(over 100x slow), possible for tetrad's java developers to choose methods without $ accessors. https://www.rforge.net/rJava/#:~:text=You%20simply%20use,issues%20with%20it.

For my understanding, rJava is current golden standard I/F between Java and R.

jdramsey commented 1 year ago

@yasu-sh This is wonderful--thank you so much for doing this! Yes, this will most certainly help me. I will spend some time in the next few days studying these options. I am interested in your observation, first of all, that causal-cmd needs file access; I hadn't thought of that as a limitation for some people, but it might be; you're right. And the remark on r-causal is beneficial as well. Could reticulate's incompatibility be overcome? Anyway, this is food for thought.

Today I also started another project, which I plan to proceed slowly with. We have a command-line tool for making algorithm comparisons that is quite nice, algcomparison. The procedure is based in Tetrad and has been heavily used, so it is well-tested. Still, Kevin Bui added a facility to allow external search results to be added to the comparison tables from other platforms like Python, R, or Matlab. I decided to code up a Python wrapper for it to do Tetrad and causal-learn algorithms, but I'd like to extend that to R algorithms in, say, PCALG or bnlearn. So at some point, I need to reverse what you're suggesting, i.e., run R algorithms from Python, not just Python algorithms from R. I'll have to think of how to do that as well. But this could be a real contribution to the literature, showing how algorithms from different projects compare to one another on standard statistics so that you can pick the best algorithms (across platforms) for a task you have in mind. I do have to solve that technical challenge, though.

jdramsey commented 1 year ago

@yasu-sh I wonder if the slowness of reflection in for loops might be because the java jar needs to be loaded each time? This was one of the clever things about JPype, I thought--their insistence that the Java jar is loaded only once per session. It doesn't seem to me that reflection itself should take very much time.

yasu-sh commented 1 year ago

@jdramsey Thanks for your quick reply.

file access on causal-cmd

I am interested in your observation, first of all, that causal-cmd needs file access; I hadn't thought of that as a limitation for some people, but it might be; you're right.

Regarding the file access, file-aceess is not limitation. I just thought it could become the limitation if you wants to hand over big-data(>2G). causa-cmd from console is basic and reliable. I like to use if there are no performance issues. (So far I use small data)

Reflection on R Java

No, it does not mean java jar's multi-calling. Java Reflection API need to find correct methods from many possibilities. This is the code of r-causal I improved last year.

cate_list$add(as.character(cate[j])) is much slower than .jcall(cate_list, "Z", "add", .jcast(.jnew("java/lang/String", as.character(cate[j])), "java/lang/Object")). There are two-for-loops on all rows and all columns. The code need time as follows: reflection columns rows. I guess this must be avoided on JPype case also as long as dynamic method access is not be suppressed. Definitely more efficient ways exist to make this. At that time I tried it to manupulate minimum.

https://github.com/bd2kccd/r-causal/blob/fc370245e938d5a2cb6e6abd4548bf7107fdd1dc/R/tetrad_utils.R#L326

loadDiscreteData <- function(df){
(cut)
    for (i in 1:length(node_names)){
(cut)
        for(j in 1:length(cate)){
#           cate_list$add(as.character(cate[j]))
          .jcall(cate_list, "Z", "add", .jcast(.jnew("java/lang/String", as.character(cate[j])), "java/lang/Object"))
        }

R from Python

I have heard that a colleague uses RPY2. It seems to work without fatal issues. https://rpy2.github.io/

jdramsey commented 1 year ago

@yasu-sh On reflection--I see. That is a real limitation. We tested the data translation methods in JPype with some rather large datasets, and they didn't slow down like that. I need to think about it.

I did notice that the Tetrad data loading routine implemented by Kevin was much faster than the one in Python for continuous data; I should test that with discrete data as well.

Let me add a discrete simulation to py_tetrad, do a save and load in Python, and transfer it back to Java, and see where there is a slowdown.

jdramsey commented 1 year ago

@yasu-sh It's fast enough in py-tetrad; I made a 500-variable dataset with N = 500 using this method:

https://github.com/cmu-phil/py-tetrad/blob/main/examples/simulating_data_discrete.py

Then I converted it from Tetrad to pandas in Python and back again and added print statements to see how long each step took. Loading the JVM took a few seconds, and the simulation itself also took a few seconds, but the conversion to pandas and the conversion from pandas to Tetrad each took about one second, which I thought was OK.

So the question is whether there's a method to transfer a dataset to R that's about the same speed.

jdramsey commented 1 year ago

@yasu-sh I spent some time today turning py-tetrad into a Python package. This may solve the file path problem. It's not done by any means, but it's going in the right direction. All of the hard-coded paths are gone.

It's not much of a package yet, just two files plus several examples, but it will grow. Also, it must be installed by checking it out from GitHub and then using pip to install the package, so the instructions have changed. But perhaps now in R you can import the package and run the examples? I'll have to try it.

yasu-sh commented 1 year ago

@jdramsey package runs successfully on python

1. (pytetrad) PS E:\PyProjects> $env:JAVA_HOME
C:\Program Files\Eclipse Adoptium\jdk-11.0.18.10-hotspot\

2. Install via pip
Successfully built py-tetrad
Installing collected packages: py-tetrad
Successfully installed py-tetrad-0.1

3.
(pytetrad) PS E:\PyProjects> cd py-tetrad/examples                    

4.
(pytetrad) PS E:\PyProjects\py-tetrad\examples> python run_searches_on_continuous_data.py > output.txt

 #### inside output.txt ####
Elapsed initializeForwardEdgesFromEmptyGraph = 0 ms
1. INSERT Attack --> Displacement [] 623.1587215188538 degree = 1 indegree = 1 cond = 1
2. INSERT Chord --> Attack [Displacement] 238.25923654533153 degree = 2 indegree = 1 cond = 2
--- Directing Displacement --> Attack
3. INSERT Frequency --> Pressure [] 117.16912428882142 degree = 2 indegree = 2 cond = 1
4. INSERT Displacement --> Pressure [Frequency] 162.33668553267944 degree = 2 indegree = 2 cond = 2
--- Directing Frequency --> Pressure
5. INSERT Chord --> Pressure [] 134.30338539976947 degree = 3 indegree = 3 cond = 3
6. INSERT Attack --> Frequency [] 50.785978197867735 degree = 3 indegree = 3 cond = 1
7. INSERT Velocity --> Pressure [] 45.056623591369316 degree = 4 indegree = 4 cond = 4
8. INSERT Attack --> Pressure [] 49.445992425935856 degree = 5 indegree = 5 cond = 5
9. INSERT Displacement --> Chord [] 30.227695613430114 degree = 5 indegree = 5 cond = 1
10. INSERT Velocity --> Frequency [Attack] 11.166001339552167 degree = 5 indegree = 5 cond = 2
--- Directing Attack --> Frequency
11. INSERT Chord --> Frequency [] 17.04597303328046 degree = 5 indegree = 5 cond = 3
12. INSERT Velocity --> Attack [Chord, Displacement] 2.2857250149374977 degree = 5 indegree = 5 cond = 3
--- Directing Chord --> Attack
--- Directing Displacement --> Attack
Elapsed time = 0.125 s

FGES

Graph Nodes:
Frequency;Attack;Chord;Velocity;Displacement;Pressure
...

I'll check at R next.

yasu-sh commented 1 year ago

@jdramsey By the way, shall we move to py-tetrad issue from here? I think this issue can be closed and the matter is for py-tetrad now.

yasu-sh commented 1 year ago

https://github.com/cmu-phil/py-tetrad/issues/1#issue-1614507475

jdramsey commented 1 year ago

Sounds good.