cmu-phil / py-tetrad

Makes algorithms/code in Tetrad available in Python via JPype
MIT License
60 stars 11 forks source link

Speed difference b/w TetradSearch class and using jpype directly #35

Open samblechman opened 2 weeks ago

samblechman commented 2 weeks ago

To get familiar with using Py-Tetrad, I wanted to run GRaSP-FCI on simulated data. Stealing bits of code from jpype_example.py, I did something like:

D, G = simulateLeeHastie(num_meas=10, samp_size=500) import edu.cmu.tetrad.search as ts_ test = ts_.test.IndTestConditionalGaussianLrt(D, 0.01, True) score = ts_.score.DegenerateGaussianScore(D, True) fci = ts_.GraspFci(test, score) G = fci.search() # then compare G_ to G...

This runs extremely quickly (< 1 second). However, when I use the TetradSearch class (TetradSearch.py), the computation time increases substantially using the same data:

D, G = simulateLeeHastie(num_meas=10, samp_size=200) df = tr.tetrad_data_to_pandas(D) import tools.TetradSearch as ts search = ts.TetradSearch(df) search.use_conditional_gaussian_test(alpha=0.01) search.use_degenerate_gaussian_score(penalty_discount=1) G_ = search.run_grasp_fci()

I believe this is conceptually equivalent to the previous example but it takes 2-3 orders of magnitude longer. If I use a discrete dataset, it runs rather quickly and in a very similar amount of time as using tetrad.search.

I am looking for help in understanding why this difference may arise. Additionally, are there functionalities present when using the TetradSearch class that are not available in tetrad.search, or vice versa?

Thank you.

jdramsey commented 2 weeks ago

Well for one thing, in the first case you're converting the data from Tetrad to Python and back again... I wonder if that could account for it.

bja43 commented 1 week ago

I see that you are using a mixed data-type simulation. Is it possible that the datatypes of your variables are getting messed up? Maybe in the second case some of the continuous variables are being treated as discrete?

samblechman commented 1 week ago

@jdramsey Just to clarify, using jpype directly is "converting data from Tetrad to Python and back again" or is that what using the TetradSearch.py class amounts to?

@bja43 Interesting thought. The slow down when using the TetradSearch.py class doesn't occur when using just discrete data, but does in the mixed case. If mixed data are being treated as discrete it would be faster, right?

bja43 commented 1 week ago

@samblechman I would expect a slowdown to occur if one or more continuous variable(s) were being treated as discrete. For instance, if a continuous column in the data with 500 instances is treated as a discrete variables with 500 unique categories, that would probably be much slower.

bja43 commented 1 week ago

To be clear, I'm not sure if this is the issue, just something to consider!