FINNGEN / autoreporting

MIT License
0 stars 1 forks source link

add dynamic R2 threshold for LD clumping. #170

Closed Fedja closed 3 years ago

Fedja commented 3 years ago

This should be chisq (top variant) * r2 = 5

In python you can get the inverse cumulative i.e. stat. score from p-value:

chi2.isf(5e-8, df=1)

Lipastomies commented 3 years ago

This might need some changes to how the LD information is acquired. Currently, there are two ways to get the LD information: using the imputation panel and PLINK, or using the LD server. Generally, both give the same results, but PLINK has a huge latency on delivering the results (but has large throughput, so it's good to bundle all queries to a single PLINK execution per chromosome for minimal latency), whereas the server has low latency, but the throughput (=how many queries performed over time) is relatively low.

Calculating LD clumping using the LD server is currently a bit unwieldy, since we currently first get the LD information, which means that we get the LD information for every variant that could be a lead variant -> we make a huge amount of requests to the server, and that takes time. Most of this info is not going to be used, as many of the lead snp candidates get added to nearby groups. This however is very good for PLINK, as its high latency is offset by the amount of info requested. When we have a custom r^2 threshold per peak, we can approach this in two ways:

I would like for us to be able to use the latter option, but then the plink LD calculation becomes very slow. We could change the way we do this based on the LD api, but that pretty much defeats the purpose of having a DAO for the LD api, since the implementation details will leak.

We could also add a "priming" step to the LD DAO, where you supply all of the candidate regions you're interested in. The server DAO would do nothing, but PLINK would preload all of the LD data (like it does now), and then querying the primed LD DAO would give you the data relatively quick (for plink, because it's already computed and somewhere easily accessible, and for server because it's quick to get). But it's quite clunky.

The quick implementation now would be to not think about that and pick the first option, i.e. set r^2 threshold to 0 and get it all. Then filter when clumping.