Closed biona001 closed 3 months ago
Attention: Patch coverage is 62.96296%
with 10 lines
in your changes missing coverage. Please review.
Project coverage is 64.73%. Comparing base (
6d00378
) to head (1390806
).
Files | Patch % | Lines |
---|---|---|
src/ghostbasil_parallel.jl | 61.53% | 10 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I tested this on the Alzheimer's disease study and the mean shrinkage value is 0.01582485377913878
after this PR (before it was 0.015295867015486209
). This difference is small because only 5/1703 regions actually trigger the new code. Even so, 4/5 regions returned similar answers, so I'll go ahead and merge anyways
Before this PR:
region 693 / 1703 (f = LD_start29737971_end30798167.h5): chr 6, nz beta = 19, nsnps = 1454, shrinkage = 0.0529
region 694 / 1703 (f = LD_start30798168_end31571217.h5): chr 6, nz beta = 15, nsnps = 1991, shrinkage = 0.0345
region 695 / 1703 (f = LD_start31571218_end32682663.h5): chr 6, nz beta = 20, nsnps = 1732, shrinkage = 0.1506
region 696 / 1703 (f = LD_start32682664_end33236496.h5): chr 6, nz beta = 16, nsnps = 1028, shrinkage = 0.0294
region 1521 / 1703 (f = LD_start69387817_end72672202.h5): chr 17, nz beta = 14, nsnps = 1278, shrinkage = 0.004
After this PR:
region 693 / 1703 (f = LD_start29737971_end30798167.h5): chr 6, nz beta = 26, nsnps = 1454, shrinkage = 0.0849
region 694 / 1703 (f = LD_start30798168_end31571217.h5): chr 6, nz beta = 17, nsnps = 1991, shrinkage = 0.0927
region 695 / 1703 (f = LD_start31571218_end32682663.h5): chr 6, nz beta = 18, nsnps = 1732, shrinkage = 0.1565
region 696 / 1703 (f = LD_start32682664_end33236496.h5): chr 6, nz beta = 13, nsnps = 1028, shrinkage = 0.8383
region 1521 / 1703 (f = LD_start69387817_end72672202.h5): chr 17, nz beta = 8, nsnps = 1278, shrinkage = 0.0
Motivation
When a LD matrix is too large (e.g. >5000 variables), the function
find_optimal_shrinkage
becomes too slow.Proposal
Rather than computing shrinkage on the entire correlation matrix, we can do so on just the group representative variables, since these variables are primary drivers for the dependencies across groups. This often reduce the LD matrix size by a factor of 10.
Potential issue
Often there are few representatives in a block (e.g. < 10). This can lead to large shrinkage values (e.g. >0.5) even though the LD matrix matches the Z-score exactly.
this PR
Now we compute shrinkage on the original matrix if the size is manageable (<1000 variables), otherwise we change to computing shrinkage value on the representatives.
Before merging
To check this approach is reasonable, we should re-run the Alzheimer's disease study to see if the shrinkage value is comparable to previously (which I think we around 0.01 to 0.02).