cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
281 stars 49 forks source link

Dask implementation of locate_unlinked() #44

Open alimanfoo opened 8 years ago

alimanfoo commented 8 years ago

The locate_unlinked() function is a bottleneck for various analyses. It should be possible to write a multi-threaded implementation using dask.

eric-czech commented 4 years ago

Hey @alimanfoo, I've been working out some of the details for doing this on a CPU and GPU and I was wondering if you had any thoughts on these:

Basically, I'd like to nail down what the crucial features of this are given that a recent experiment with a GPU-based implementation (ld_prune/tsgpu_backend.py mentioned in https://github.com/related-sciences/gwas-analysis/issues/26 builds on some of your skallel-stats cuda.jit examples) showed great improvements over attempts to numba.jit a CPU implementation of locate_unlinked on chunked arrays. I'd love to get your thoughts/preferences on what an ideal version of this should do.

alimanfoo commented 4 years ago

Hey @eric-czech, forgive me if I haven't fully grokked everything here, I might need to ask you to walk me through this next time we're on a call (we should have one soon).

FWIW I've always thought of this as two problems: (1) having an algorithm based on sequentially scanning through a genome - the results of computation on one window depend on results from the previous window - makes coarse-grained (chunk-based) parallelism impossible, and ((2) needing to compute r2 within each window means that you have a computation that scales with the square of the number of samples, which is obviously a problem for very large cohorts, even if you can run on a GPU).

But the plink-style algorithm is very heuristic anyway, and the goal is just to reduce the impact of linkage, but you'll never eliminate it completely. So some approximations or alternative heuristics ought to be OK IMHO.

FWIW I think it would be reasonable to conceive of an LD pruning algorithm that computed results for each window independently, rather than scanning along the genome in overlapping windows. It would find less linked variants, but maybe that's acceptable for some applications at least.

If you have no dependency between windows, then the computation within a window becomes simpler. It would mean that the order in which you choose to perform comparisons between variants doesn't matter. Choosing to keep the variant with the lower genome position as in your implementation is maybe fine.

Although it might be worth working through what happens when, say, you have three variants A, B and C in genome order, with r2(A, B) = 0.3, r(B, C) = 0.3, r2(A, C) = 0.1. If you set your threshold to 0.2, then the original scanning algorithm compares (A, B), eliminates B, but keeps C. Your algorithm (I think) would eliminate B and C. Again maybe that's fine, but maybe worth understanding.

(Also FWIW I think it would be reasonable to randomly downsample individuals, to avoid trying to compute directly on massive cohorts.)

Re bp-windows versus variant windows, I suspect either approach is OK. What you ideally want is windows of equal sized genetic distance (i.e., recombination length), but that's more hassle than it's worth here IMHO. Again this is just a heuristic.

alimanfoo commented 4 years ago

Edited above comment, wasn't thinking straight re how computation within a window scales, apologies.

alimanfoo commented 4 years ago

What you ideally want is windows of equal sized genetic distance (i.e., recombination length), but that's more hassle than it's worth here IMHO. Again this is just a heuristic.

Also wasn't thinking straight here, if you had a recombination map then you could probably do away with LD pruning altogether, since recombination maps are built from LD information, you could just downsample variants to get roughly equal genetic distance between them.

eric-czech commented 4 years ago

Sounds great, I'll talk to Jeff and try to get something our schedules. Thanks for taking a little time to think about this with me. To your points:

how computation within a window scales

For posterity and b/c I was thinking about it when doing the thread indexing, I believe my version, PLINK, and the current locate_unlinked would all have worst-case characteristics like:

m = num variants
n = num samples
w = window size
s = step size
intsum(x) = x * (x + 1) / 2

then for all w >= s,

runtime complexity = (intsum(w) - intsum(w - s)) * 2n * (m / s) = O(wnm)
* worst case is when s = 1

I'll add that to the docs somewhere and hopefully that squares with what you're thinking.

say, you have three variants A, B and C in genome order, with r2(A, B) = 0.3, r(B, C) = 0.3, r2(A, C) = 0.1. If you set your threshold to 0.2, then the original scanning algorithm compares (A, B), eliminates B, but keeps C. Your algorithm (I think) would eliminate B and C.

Yep, assuming MAF was not provided then that's exactly what it would do -- so it is more aggressive and would generally eliminate a larger number of variants. It's good to know you think of these approaches as fairly flexible heuristics because I think this would oppose the more dogmatic views I've seen on LD pruning. Personally, I have yet to see any empirical study on why something like identifying maximal independent sets within networks of correlated variants improves PCA or kinship estimation (or whatever else is downstream), but I like the idea of having a toolkit where fast, simple methods exist alongside more sophisticated, slower ones so that it's easy to do a sensitivity analysis on data subsets and then decide if you want to scale out with the fast approach or spend $ for the slow one.

My tentative plan is to have a complementary CPU backend for this operation that lives on the other end of the spectrum, with more features and probably higher space and time complexity (especially if it tries to match Hail). I'd love to get your thoughts on this.

Also FWIW I think it would be reasonable to randomly downsample individuals, to avoid trying to compute directly on massive cohorts.

I meant to ask you if that's something you do, because it seems reasonable to me and I have a hard time convincing myself whether or not it makes sense to support chunked arrays in both dimensions vs compressing information along the sample dimension. What made you cross it out?

alimanfoo commented 4 years ago

For posterity and b/c I was thinking about it when doing the thread indexing, I believe my version, PLINK, and the current locate_unlinked would all have worst-case characteristics like:

m = num variants
n = num samples
w = window size
s = step size
intsum(x) = x * (x + 1) / 2

then for all w >= s,

runtime complexity = (intsum(w) - intsum(w - s)) * 2n * (m / s) = O(wnm)
* worst case is when s = 1

If within each window you are computing the correlation between all pairs of variants, then doesn't complexity scale with the square of the window size?

say, you have three variants A, B and C in genome order, with r2(A, B) = 0.3, r(B, C) = 0.3, r2(A, C) = 0.1. If you set your threshold to 0.2, then the original scanning algorithm compares (A, B), eliminates B, but keeps C. Your algorithm (I think) would eliminate B and C.

Yep, assuming MAF was not provided then that's exactly what it would do -- so it is more aggressive and would generally eliminate a larger number of variants. It's good to know you think of these approaches as fairly flexible heuristics because I think this would oppose the more dogmatic views I've seen on LD pruning. Personally, I have yet to see any empirical study on why something like identifying maximal independent sets within networks of correlated variants improves PCA or kinship estimation (or whatever else is downstream), but I like the idea of having a toolkit where fast, simple methods exist alongside more sophisticated, slower ones so that it's easy to do a sensitivity analysis on data subsets and then decide if you want to scale out with the fast approach or spend $ for the slow one.

FWIW that sounds good to me. In our mosquito data we have much bigger issues that can affect analysis of population structure (e.g., Figure 1b), like large chromosomal inversions, or qualitative differences in genealogies within pericentromeric regions, or genome regions under strong recent selection, or long-range LD in populations that have experienced bottlenecks compared with others that have expanded. Also recombination rate is much more even across the genome, we don't have recombination hotspots and large LD blocks like in humans. So LD pruning to deal with variations in recombination rate over the genome tends to be the least of our worries :-) Hence why I personally would be happy with simpler heuristics and gain scalability. But I am not a human geneticist and have been thinking about these issues for a fraction of the time that many human genetics folks have, so totally respect other views.

My tentative plan is to have a complementary CPU backend for this operation that lives on the other end of the spectrum, with more features and probably higher space and time complexity (especially if it tries to match Hail). I'd love to get your thoughts on this.

That sounds very reasonable FWIW.

Also FWIW I think it would be reasonable to randomly downsample individuals, to avoid trying to compute directly on massive cohorts.

I meant to ask you if that's something you do, because it seems reasonable to me and I have a hard time convincing myself whether or not it makes sense to support chunked arrays in both dimensions vs compressing information along the sample dimension. What made you cross it out?

I only crossed it out because I got the complexity wrong, it's less of an issue if complexity scales linearly with the number of samples, but it's still worth talking about though.

One questions I wonder about is whether you might need to downsample individuals in a way that is aware of population structure or not. LD patterns might be different between different populations, and so the effect of LD pruning might be different depending on which populations you have represented and how many samples you have from each. Population structure itself also creates apparent LD, and so that could impact on results.

Again if all you need is a simple heuristic, and you're working with organisms where recombination rate is fairly even over the genome, then this may be less of an issue. I.e., just randomly downsampling your cohort - whatever it contains - might be OK. But would be interesting to discuss with folks who've thought more deeply.

eric-czech commented 4 years ago

If within each window you are computing the correlation between all pairs of variants, then doesn't complexity scale with the square of the window size?

Not as long as the code makes sure to skip comparisons in overlapping windows. I saw locate_unlinked has that, assuming I read it correctly, and mine does too (albeit in very different ways). The simplest way I could think of it myself is when step=1, then each variant is compared with the next w variants, for all m variants. When the step is > 1 it's always less work that needs to happen.

Also recombination rate is much more even across the genome, we don't have recombination hotspots and large LD blocks like in humans. So LD pruning to deal with variations in recombination rate over the genome tends to be the least of our worries.

I see. Forgive me for not having read more of them yet, but do any of your papers draw comparisons like this to other organisms (outside of humans and mosquitos)? I'm wondering how many non-human geneticists are in a similar situation.

One questions I wonder about is whether you might need to downsample individuals in a way that is aware of population structure or not

Makes sense, perhaps there is some precedent for stratified sampling on self-reported ancestry or a way to do it with fast geaneology estimation methods (e.g. tsinfer). I'll keep an eye out for related work and maybe start a discourse post if anything interesting turns up.

alimanfoo commented 4 years ago

If within each window you are computing the correlation between all pairs of variants, then doesn't complexity scale with the square of the window size?

Not as long as the code makes sure to skip comparisons in overlapping windows. I saw locate_unlinked has that, assuming I read it correctly, and mine does too (albeit in very different ways). The simplest way I could think of it myself is when step=1, then each variant is compared with the next w variants, for all m variants. When the step is > 1 it's always less work that needs to happen.

Ah yes, I see, thanks. I forgot you could skip the comparisons in overlapping windows, yes locate_unlinked does that. Been a while since I looked at that code.

Also recombination rate is much more even across the genome, we don't have recombination hotspots and large LD blocks like in humans. So LD pruning to deal with variations in recombination rate over the genome tends to be the least of our worries.

I see. Forgive me for not having read more of them yet, but do any of your papers draw comparisons like this to other organisms (outside of humans and mosquitos)? I'm wondering how many non-human geneticists are in a similar situation.

No, sorry, we've been pretty mosquito-centric here. I know Drosophila is fairly similar, no major evidence for recombination hotspots, at least on the scale seen in humans. Plasmodium parasites also have very even recombination rates through the genome. I suspect you'll see all kinds of crazy stuff in different parts of the tree of life :)

One questions I wonder about is whether you might need to downsample individuals in a way that is aware of population structure or not

Makes sense, perhaps there is some precedent for stratified sampling on self-reported ancestry or a way to do it with fast geaneology estimation methods (e.g. tsinfer). I'll keep an eye out for related work and maybe start a discourse post if anything interesting turns up.

Sounds good, thanks.

eric-czech commented 4 years ago

Been a while since I looked at that code

Impressive you remembered this much for a 5 yr old issue! Thanks for revisiting it with me.