Closed spficklin closed 4 years ago
@spficklin I took a quick look at the corrpower analytic and I think the poor performance has to do with how the analytic is iterating through the input data. It is iterating through every pairwise index the same way as similarity, and trying to read a gene pair for each index. This results in a lot of unnecessary work because the correlation matrix is very sparse.
If you look at other analytics such as export correlation matrix, extract, or conditional test you will see that these analytics iterate directly through the gene pairs in the input data, instead of testing every possible pairwise index. Of course this makes it harder to split up the work into uniform chunks but it seems you figured out how to do that with conditional test so I would refer to that analytic.
Thanks @bentsherman , I'll try to adjust it and see if that fixes things.
I just pushed some changes that should resolve the issues you've been having with cond-test and corrpower. If you look at the code you will see that cond-test, corrpower, and similarity follow a similar pattern in terms of work blocks and result blocks.
Additionally, while similarity does iterate through every pairwise combination, cond-test and corrpower need only iterate through the pairs in the sparse CCM/CMX files. In this respect you will see that they follow a similar pattern as import-cmx, export-cmx, and extract.
So now both cond-test and corrpower should work with MPI and corrpower should be much faster. However I don't have the necessary input data to test them thoroughly so if you could please have someone test them (including MPI), and have them respond here with any issues they have. If everything works then I'll close out these issues.
Okay. I'll test it. Thanks.
This problem was fixed by the code that @4ctrl-alt-del added to PR #132 . But that PR got closed. I'm assuming this fix got moved into the adjustments made by @bentsherman.
The corrpower filter, which removes edges that don't have sufficient power, is too slow. Its much much slower than conditional filtering. Perhaps a way to speed it up is to keep a lookup table of the parameters and when the exact combination has already been performed just use the lookup table rather than recompute.