Closed yasu-sh closed 1 year ago
One thing that may be nice from your point of view for R is that we wrote a method to save out graphs in the "endpoint matrix" format of PCALG--it's returned as a numpy array, though I may change that to a pandas data frame so the variable names can be returned as well; you can always get a numpy array from the data frame if you don't need the variable names. But that might be the beginning of a way to interface Java with R in a more up-to-date way.
Anyway, first things first; we'd like to make it very easy to use Tetrad from Python, much easier than currently, and py-tetrad seems to promise that.
@jdramsey I have confirmed working py-tetrad at my windows machine. It looks working well as long as python environment only.
OpenJDK Runtime Environment Temurin-11.0.18+10 (build 11.0.18+10)
JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-11.0.18.10-hotspot\
Python 3.10.10 (tags/v3.10.10:aad5f6a, Feb 7 2023, 17:20:36) [MSC v.1929 64 bit (AMD64)] on win32
pip install JPype1
The location of 'file' is here. E:\PyProjects\py-tetrad\examples.. E:\PyProjects\py-tetrad\examples\run_searches_on_discrete_data.py
Huh, interesting! OK, I learned something. Feel free to let me know if you find any valuable tweaks. But the Java version and the Python version are good, based on what I was expecting.
I think I know what you mean by "python environment only"--it would be nice if I could figure out how to publish the project in PyPl. I'll have to figure out how to fix the paths, but I can look at other projects.
And I promise I will start thinking about R again.
This was cross-comment on "python environment"... Sorry. (this was added later)
-- I notice that there are different way to call py-tetrad from my exisitng R codes. Which do you recommend? I think the simplest way is the same as causal-cmd. ie 1.
Possible ways from R:
__file__ returns:
> case1: virtualenv - current virtualenv's activate_this.py.
> case2: plain python.exe - not defined.
Questions:
py-tetrad ran with PyCharm CE on Win10 at Japanese locale! @jdramsey Thank you for your help!
That's great! :-D I'm glad it worked!
It's good to know there are some options for R; I'll try them out when I can. Our old project, https://github.com/bd2kccd/r-causal, used the command line option, but I still need to look closely at it. (Unfortunately, the guy who was maintaining that code left us, and we haven't had a chance to work through what he did, but I will.) The current problem is that it uses an old version of Tetrad, and a lot of the code needs to be updated; in that sense accessing Tetrad through Python would be a great option if it could work, though that does involve coordinating three languages.
Also, I am curious to know whether your problems are Windows-specific yet. Maybe I could get it to work on my Mac.
Looking briefly at the Reticulate docs, it seems I could solve the file path problem by figuring out how to publish py-tetrad to PyPl. I've never done that before, but maybe I can figure it out. Then you can just.
import pytetrad,
and then continue using the functionality in your project from there. I'm still learning how to do these sorts of things in Python.
@jdramsey Thank you for your investigation and document check. I feel that the package-reticulate could require more time than simple call from console when breaking down issues.
For example, the latest version of reticulate(v.1.28-compiled on my PC) makes bombing error when calling python. But stable version(v 1.24-cran binary) works fine when calling python at my PC.
ex. https://github.com/rstudio/reticulate/issues/1258 I guess there are many similar incompatiblities. ex. my case:
R console
> sys$executable
[1] "C:\\Program Files\\RStudio\\resources\\app\\bin\\rsession.exe"
IPython
>>> sys.executable
'C:\\Python\\venv\\pytetrad\\Scripts\\python.exe'
> library(reticulate)
> setwd("E:/PyProjects/py-tetrad")
> py_run_file("./examples/run_searches_on_discrete_data.py")
Error in py_run_file_impl(file, local, convert) :
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Python/venv/pytetrad/Scripts\\../examples/resources/bridges.data.version211_rev.txt'
Detailed traceback:
File "<string>", line 30, in <module>
File "C:\Python\venv\pytetrad\lib\site-packages\pandas\io
This is return value of file
C:/Python/venv/pytetrad/Scripts.. C:/Python/venv/pytetrad/Scripts/activate_this.py
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932 LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C LC_TIME=Japanese_Japan.932
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reticulate_1.24
loaded via a namespace (and not attached):
[1] compiler_4.0.2 Matrix_1.2-18 tools_4.0.2 Rcpp_1.0.7 grid_4.0.2 jsonlite_1.7.2 png_0.1-7 lattice_0.20-41
> py_config()
python: C:/Python/venv/pytetrad/Scripts/python.exe
libpython: E:/Program Files/Python310/python310.dll
pythonhome: C:/Python/venv/pytetrad
virtualenv: C:/Python/venv/pytetrad/Scripts/activate_this.py
version: 3.10.10 (tags/v3.10.10:aad5f6a, Feb 7 2023, 17:20:36) [MSC v.1929 64 bit (AMD64)]
Architecture: 64bit
numpy: C:/Python/venv/pytetrad/Lib/site-packages/numpy
numpy_version: 1.24.2
NOTE: Python version was forced by RETICULATE_PYTHON
@jdramsey This is my understanding on several ways to use tetrad at console / other packages in Python/R. This information may be beneficial for you or not. But I thought it contributes reducing time from your side.
Project | Calling place | jar file | I/F Package | I/F Library | Concerns / Notices |
---|---|---|---|---|---|
r-causal | R | causal-cmd | rJava | JNI | Reflection needs to be avoid in for-loops. |
py-tetrad | R | tetrad | reticulate/JPype | JNI | needs to care about reticulate's incompatibility |
py-tetrad | Python | tetrad | JPype | JNI | NA |
py-causal | Python | causal-cmd | Javabridge | JNA | NA |
causal-cmd | Console | causal-cmd | None | None | Need file access / for general usage |
r-causal: Need to avoid reflection(over 100x slow), possible for tetrad's java developers to choose methods without $ accessors. https://www.rforge.net/rJava/#:~:text=You%20simply%20use,issues%20with%20it.
For my understanding, rJava is current golden standard I/F between Java and R.
@yasu-sh This is wonderful--thank you so much for doing this! Yes, this will most certainly help me. I will spend some time in the next few days studying these options. I am interested in your observation, first of all, that causal-cmd needs file access; I hadn't thought of that as a limitation for some people, but it might be; you're right. And the remark on r-causal is beneficial as well. Could reticulate's incompatibility be overcome? Anyway, this is food for thought.
Today I also started another project, which I plan to proceed slowly with. We have a command-line tool for making algorithm comparisons that is quite nice, algcomparison. The procedure is based in Tetrad and has been heavily used, so it is well-tested. Still, Kevin Bui added a facility to allow external search results to be added to the comparison tables from other platforms like Python, R, or Matlab. I decided to code up a Python wrapper for it to do Tetrad and causal-learn algorithms, but I'd like to extend that to R algorithms in, say, PCALG or bnlearn. So at some point, I need to reverse what you're suggesting, i.e., run R algorithms from Python, not just Python algorithms from R. I'll have to think of how to do that as well. But this could be a real contribution to the literature, showing how algorithms from different projects compare to one another on standard statistics so that you can pick the best algorithms (across platforms) for a task you have in mind. I do have to solve that technical challenge, though.
@yasu-sh I wonder if the slowness of reflection in for loops might be because the java jar needs to be loaded each time? This was one of the clever things about JPype, I thought--their insistence that the Java jar is loaded only once per session. It doesn't seem to me that reflection itself should take very much time.
@jdramsey Thanks for your quick reply.
I am interested in your observation, first of all, that causal-cmd needs file access; I hadn't thought of that as a limitation for some people, but it might be; you're right.
Regarding the file access, file-aceess is not limitation. I just thought it could become the limitation if you wants to hand over big-data(>2G). causa-cmd from console is basic and reliable. I like to use if there are no performance issues. (So far I use small data)
No, it does not mean java jar's multi-calling. Java Reflection API need to find correct methods from many possibilities. This is the code of r-causal I improved last year.
cate_list$add(as.character(cate[j]))
is much slower than .jcall(cate_list, "Z", "add", .jcast(.jnew("java/lang/String", as.character(cate[j])), "java/lang/Object"))
.
There are two-for-loops on all rows and all columns
. The code need time as follows: reflection columns rows.
I guess this must be avoided on JPype case also as long as dynamic method access is not be suppressed.
Definitely more efficient ways exist to make this. At that time I tried it to manupulate minimum.
loadDiscreteData <- function(df){
(cut)
for (i in 1:length(node_names)){
(cut)
for(j in 1:length(cate)){
# cate_list$add(as.character(cate[j]))
.jcall(cate_list, "Z", "add", .jcast(.jnew("java/lang/String", as.character(cate[j])), "java/lang/Object"))
}
I have heard that a colleague uses RPY2. It seems to work without fatal issues. https://rpy2.github.io/
@yasu-sh On reflection--I see. That is a real limitation. We tested the data translation methods in JPype with some rather large datasets, and they didn't slow down like that. I need to think about it.
I did notice that the Tetrad data loading routine implemented by Kevin was much faster than the one in Python for continuous data; I should test that with discrete data as well.
Let me add a discrete simulation to py_tetrad, do a save and load in Python, and transfer it back to Java, and see where there is a slowdown.
@yasu-sh It's fast enough in py-tetrad; I made a 500-variable dataset with N = 500 using this method:
https://github.com/cmu-phil/py-tetrad/blob/main/examples/simulating_data_discrete.py
Then I converted it from Tetrad to pandas in Python and back again and added print statements to see how long each step took. Loading the JVM took a few seconds, and the simulation itself also took a few seconds, but the conversion to pandas and the conversion from pandas to Tetrad each took about one second, which I thought was OK.
So the question is whether there's a method to transfer a dataset to R that's about the same speed.
@yasu-sh I spent some time today turning py-tetrad into a Python package. This may solve the file path problem. It's not done by any means, but it's going in the right direction. All of the hard-coded paths are gone.
It's not much of a package yet, just two files plus several examples, but it will grow. Also, it must be installed by checking it out from GitHub and then using pip to install the package, so the instructions have changed. But perhaps now in R you can import the package and run the examples? I'll have to try it.
@jdramsey package runs successfully on python
1. (pytetrad) PS E:\PyProjects> $env:JAVA_HOME
C:\Program Files\Eclipse Adoptium\jdk-11.0.18.10-hotspot\
2. Install via pip
Successfully built py-tetrad
Installing collected packages: py-tetrad
Successfully installed py-tetrad-0.1
3.
(pytetrad) PS E:\PyProjects> cd py-tetrad/examples
4.
(pytetrad) PS E:\PyProjects\py-tetrad\examples> python run_searches_on_continuous_data.py > output.txt
#### inside output.txt ####
Elapsed initializeForwardEdgesFromEmptyGraph = 0 ms
1. INSERT Attack --> Displacement [] 623.1587215188538 degree = 1 indegree = 1 cond = 1
2. INSERT Chord --> Attack [Displacement] 238.25923654533153 degree = 2 indegree = 1 cond = 2
--- Directing Displacement --> Attack
3. INSERT Frequency --> Pressure [] 117.16912428882142 degree = 2 indegree = 2 cond = 1
4. INSERT Displacement --> Pressure [Frequency] 162.33668553267944 degree = 2 indegree = 2 cond = 2
--- Directing Frequency --> Pressure
5. INSERT Chord --> Pressure [] 134.30338539976947 degree = 3 indegree = 3 cond = 3
6. INSERT Attack --> Frequency [] 50.785978197867735 degree = 3 indegree = 3 cond = 1
7. INSERT Velocity --> Pressure [] 45.056623591369316 degree = 4 indegree = 4 cond = 4
8. INSERT Attack --> Pressure [] 49.445992425935856 degree = 5 indegree = 5 cond = 5
9. INSERT Displacement --> Chord [] 30.227695613430114 degree = 5 indegree = 5 cond = 1
10. INSERT Velocity --> Frequency [Attack] 11.166001339552167 degree = 5 indegree = 5 cond = 2
--- Directing Attack --> Frequency
11. INSERT Chord --> Frequency [] 17.04597303328046 degree = 5 indegree = 5 cond = 3
12. INSERT Velocity --> Attack [Chord, Displacement] 2.2857250149374977 degree = 5 indegree = 5 cond = 3
--- Directing Chord --> Attack
--- Directing Displacement --> Attack
Elapsed time = 0.125 s
FGES
Graph Nodes:
Frequency;Attack;Chord;Velocity;Displacement;Pressure
...
I'll check at R next.
@jdramsey By the way, shall we move to py-tetrad issue from here? I think this issue can be closed and the matter is for py-tetrad now.
Sounds good.
I am wondering why the calculation time has big difference between as of 2019 and as of 2023. Could you tell me why this big difference happens? algorithms is different? I have used profiling tool at IntelliJ IDEA at causal-cmd v.1.5.0. but the process looks normal and test consumes log-gamma calculation a lot.
r-causal based on causal-cmd ver. 1.2.0-SNAPSHOT
causal-cmd based on the @kvb2univpitt 's release
no prior (no information at prior.txt) dataset: hailfinder (https://github.com/bd2kccd/causal-cmd/files/10555256/dt.tetrad.csv)
condition is same as below, including dataset.: https://github.com/bd2kccd/causal-cmd/issues/80#issuecomment-1411716409
The process aborted since the profiling makes much longer.