Improve LD shrinkage efficiency

biona001 commented 3 months ago

Motivation

When a LD matrix is too large (e.g. >5000 variables), the function find_optimal_shrinkage becomes too slow.

Proposal

Rather than computing shrinkage on the entire correlation matrix, we can do so on just the group representative variables, since these variables are primary drivers for the dependencies across groups. This often reduce the LD matrix size by a factor of 10.

Potential issue

Often there are few representatives in a block (e.g. < 10). This can lead to large shrinkage values (e.g. >0.5) even though the LD matrix matches the Z-score exactly.

this PR

Now we compute shrinkage on the original matrix if the size is manageable (<1000 variables), otherwise we change to computing shrinkage value on the representatives.

Before merging

To check this approach is reasonable, we should re-run the Alzheimer's disease study to see if the shrinkage value is comparable to previously (which I think we around 0.01 to 0.02).

codecov-commenter commented 3 months ago

Codecov Report

Attention: Patch coverage is 62.96296% with 10 lines in your changes missing coverage. Please review.

Project coverage is 64.73%. Comparing base (6d00378) to head (1390806).

Files	Patch %	Lines
src/ghostbasil_parallel.jl	61.53%	10 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #35 +/- ## ========================================== - Coverage 65.09% 64.73% -0.37% ========================================== Files 5 5 Lines 679 689 +10 ========================================== + Hits 442 446 +4 - Misses 237 243 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

biona001 commented 3 months ago

I tested this on the Alzheimer's disease study and the mean shrinkage value is 0.01582485377913878 after this PR (before it was 0.015295867015486209). This difference is small because only 5/1703 regions actually trigger the new code. Even so, 4/5 regions returned similar answers, so I'll go ahead and merge anyways

Before this PR:

region 693 / 1703 (f = LD_start29737971_end30798167.h5): chr 6, nz beta = 19, nsnps = 1454, shrinkage = 0.0529
region 694 / 1703 (f = LD_start30798168_end31571217.h5): chr 6, nz beta = 15, nsnps = 1991, shrinkage = 0.0345
region 695 / 1703 (f = LD_start31571218_end32682663.h5): chr 6, nz beta = 20, nsnps = 1732, shrinkage = 0.1506
region 696 / 1703 (f = LD_start32682664_end33236496.h5): chr 6, nz beta = 16, nsnps = 1028, shrinkage = 0.0294
region 1521 / 1703 (f = LD_start69387817_end72672202.h5): chr 17, nz beta = 14, nsnps = 1278, shrinkage = 0.004

After this PR:

region 693 / 1703 (f = LD_start29737971_end30798167.h5): chr 6, nz beta = 26, nsnps = 1454, shrinkage = 0.0849
region 694 / 1703 (f = LD_start30798168_end31571217.h5): chr 6, nz beta = 17, nsnps = 1991, shrinkage = 0.0927
region 695 / 1703 (f = LD_start31571218_end32682663.h5): chr 6, nz beta = 18, nsnps = 1732, shrinkage = 0.1565
region 696 / 1703 (f = LD_start32682664_end33236496.h5): chr 6, nz beta = 13, nsnps = 1028, shrinkage = 0.8383
region 1521 / 1703 (f = LD_start69387817_end72672202.h5): chr 17, nz beta = 8, nsnps = 1278, shrinkage = 0.0

biona001 / GhostKnockoffGWAS