cmu-phil / py-tetrad

Makes algorithms/code in Tetrad available in Python via JPype
MIT License
59 stars 11 forks source link

Getting java.lang.IllegalArgumentException: matrix too large #17

Closed phyk closed 6 months ago

phyk commented 8 months ago

I am trying out tetrad through the py-tetrad library. The basic examples work fine, now I want to use a large dataset and run a DAG search algorithm for mixed data. The data has a shape of (1597997, 60), so I have a large number of samples. Is the dataset simply too large? What is the boundary?

Thanks for your help

jdramsey commented 8 months ago

Is that 1597997 variables with 60 cases? Or is it transposed the other way? Either way, you're right; it is a bit large. Sixty variables are not problematic for py-tetrad. Here's a recent paper of ours that takes causal search accurately out to 1000 variables or more using the BOSS algorithm:

https://arxiv.org/abs/2310.17679

We are trying to think of ways to extend that (accurately). But 1597997 variables can't be represented; I don't know of any causal search algorithm with super-good stats that will handle that many variables in any case. Usually, what people do is pick a target variable or set of target variables and choose only variables for a subset of the variables that are correlated with one of those.

If it's 1597997 samples, I calculate that will take about 800 MB to store, so you may need to increase the heap size allocated to Java. (I can look up how to do that in py-tetrad.) In general, though, you don't need that many samples for tests and scores to converge; you could take a random subsample of, say, 5000 from that and increase the random subsample size until you hit the limit.

jdramsey commented 8 months ago

There's a problem for mixed data as well (since you mentioned that). Take the case of just fully multinomial data. In fact, just assume for the nonce that it is all binary. If you use a test or score that is building multinomial tables for judging tests or scores, those tables could conceivably be very large in size. Say your conditioning set includes 10 variables; then you'd need a table with 2^11 = 2048 rows. This is do-able, but you can imagine how the problem could get out of hand. Also, you would need to do the actual counting for each such table--i.e., counting how many data points go into each cell in the table. This can take a very long time.

This problem exists for mixed data as well in Tetrad if you use the conditional Gaussian score since, for the discrete variables, you need essentially to construct tables like this. A possible workaround is to use the Degenerate Gaussian score, which multiplies indicator variables for all but one category for each discrete variable and treats the problem as linear, automatically converting the result back into a graph over the variable in the dataset. This may help in your case; it is in any case more scalable in terms of sample size.

jdramsey commented 8 months ago

Oh, here's the paper on the Degenerate Gaussian score--you can see how it works.

Andrews, B., Ramsey, J., & Cooper, G. F. (2019, July). Learning high-dimensional directed acyclic graphs with mixed data-types. In The 2019 ACM SIGKDD Workshop on Causal Discovery (pp. 4-21). PMLR.

phyk commented 8 months ago

It is 1,5 Million samples, not variables. To be precise, the error arose only for the MGM algorithm, just so you know.

Thanks for the insights, i will try around with subsamples and tests.

jdramsey commented 8 months ago

@phyk Did you any success? I meant to ask...

jdramsey commented 8 months ago

Looks like this issue is dead--I'll close it. If you have further questions, please open a new issue. :-) We're publishing a new version, 7.6.2, hopefully within a week or so (though don't hold me to that date; it's not just up to me).

phyk commented 8 months ago

Sorry that I did not respond. I am playing around with it and will share my results in this issue as soon as I get some, to document for future users.

jdramsey commented 8 months ago

Oh, thanks!!!!! :-D

phyk commented 8 months ago

First update: The settings I used are a dataframe with 1597997 rows , 60 columns. It consists of mixed data and I wanted to try the MGM algorithm.

The initial try failed due to the "java.lang.IllegalArgumentException: matrix too large" exception. Then I tried reducing the dataset rows (i.e. the number of samples) until there was no more exception. For both 100_000 and 50_000 samples the exception rose again, for 20_000 samples it is now running for a while.

jdramsey commented 6 months ago

Sounds like this is in progress--I'll close it for now but feel free to re-open or make another issue.