Closed phyk closed 8 months ago
Is that 1597997 variables with 60 cases? Or is it transposed the other way? Either way, you're right; it is a bit large. Sixty variables are not problematic for py-tetrad. Here's a recent paper of ours that takes causal search accurately out to 1000 variables or more using the BOSS algorithm:
https://arxiv.org/abs/2310.17679
We are trying to think of ways to extend that (accurately). But 1597997 variables can't be represented; I don't know of any causal search algorithm with super-good stats that will handle that many variables in any case. Usually, what people do is pick a target variable or set of target variables and choose only variables for a subset of the variables that are correlated with one of those.
If it's 1597997 samples, I calculate that will take about 800 MB to store, so you may need to increase the heap size allocated to Java. (I can look up how to do that in py-tetrad.) In general, though, you don't need that many samples for tests and scores to converge; you could take a random subsample of, say, 5000 from that and increase the random subsample size until you hit the limit.
There's a problem for mixed data as well (since you mentioned that). Take the case of just fully multinomial data. In fact, just assume for the nonce that it is all binary. If you use a test or score that is building multinomial tables for judging tests or scores, those tables could conceivably be very large in size. Say your conditioning set includes 10 variables; then you'd need a table with 2^11 = 2048 rows. This is do-able, but you can imagine how the problem could get out of hand. Also, you would need to do the actual counting for each such table--i.e., counting how many data points go into each cell in the table. This can take a very long time.
This problem exists for mixed data as well in Tetrad if you use the conditional Gaussian score since, for the discrete variables, you need essentially to construct tables like this. A possible workaround is to use the Degenerate Gaussian score, which multiplies indicator variables for all but one category for each discrete variable and treats the problem as linear, automatically converting the result back into a graph over the variable in the dataset. This may help in your case; it is in any case more scalable in terms of sample size.
Oh, here's the paper on the Degenerate Gaussian score--you can see how it works.
Andrews, B., Ramsey, J., & Cooper, G. F. (2019, July). Learning high-dimensional directed acyclic graphs with mixed data-types. In The 2019 ACM SIGKDD Workshop on Causal Discovery (pp. 4-21). PMLR.
It is 1,5 Million samples, not variables. To be precise, the error arose only for the MGM algorithm, just so you know.
Thanks for the insights, i will try around with subsamples and tests.
@phyk Did you any success? I meant to ask...
Looks like this issue is dead--I'll close it. If you have further questions, please open a new issue. :-) We're publishing a new version, 7.6.2, hopefully within a week or so (though don't hold me to that date; it's not just up to me).
Sorry that I did not respond. I am playing around with it and will share my results in this issue as soon as I get some, to document for future users.
Oh, thanks!!!!! :-D
First update: The settings I used are a dataframe with 1597997 rows , 60 columns. It consists of mixed data and I wanted to try the MGM algorithm.
The initial try failed due to the "java.lang.IllegalArgumentException: matrix too large" exception. Then I tried reducing the dataset rows (i.e. the number of samples) until there was no more exception. For both 100_000 and 50_000 samples the exception rose again, for 20_000 samples it is now running for a while.
Sounds like this is in progress--I'll close it for now but feel free to re-open or make another issue.
I am trying out tetrad through the py-tetrad library. The basic examples work fine, now I want to use a large dataset and run a DAG search algorithm for mixed data. The data has a shape of (1597997, 60), so I have a large number of samples. Is the dataset simply too large? What is the boundary?
Thanks for your help