Allow linkage disequilibrium computations to be done from phased input data

nabil161289 commented 10 months ago

A clear description of your feature request

Calculation of linkage disequilibrium via phased data and comparing phased and unphased LD via a Sign test. Also for 2-locus haplotypes, calculation of LD statistics via phased data if unphased data is being currently used.

Describe the solution you'd like

The estimates of LD (D, D prime, Wn, and the 2-locus pairwise LD statistics D and D prime) to be calculated via phased data, together with the correlation coefficient (r^2).

Describe alternatives you've considered

Using Pould package in R

Additional context

No response

alexlancaster commented 10 months ago

Thanks again for the report, @nabil161289 . I think the long-term plan is to deprecate [Emhaplofreq] in favour of [Haplostats]. Haplostats is a wrapper around the R package haplo.stats. There is already an alpha version of Haplostats (undocumented) already contained within PyPop, but it is not yet ready for production release.

At a glance, the upstream R package does mention phased genotypes, but I'm not whether we're currently including functionality, and if it's easy or even possible to do that. @rsingle may know more about that.

alexlancaster commented 10 months ago

I think that computing the LD measures directly from the phased input will have to wait until we finish off the Haplostats module (which includes the wrapper around haplo.stats). The idea is that haplo.stats C code will be used solely do the EM haplotype estimation, and the rest of the LD computations currently done inside the emhaplofreq program will be moved into Python.

Once that is done, it will be easier to have the LD computations work off phased input directly from the input data, or input out of haplo.stats, by using the same Python to do both. Haplostats is partly implemented, but it's not completely there, and this would be another enhancement on top of what we already planned. All of this is fairly non-trivial, so it may be a while.

sjmack commented 10 months ago

In theory, since phased LD calculations rely only on counts of phased variants, which in PyPop TSV format would be identified as (for two alleles of a 3 locus haplotype for two subjects):

Row X (locus1_1*allele & locus2_1*allele & locus3_1*allele) and (locus1_2*allele & locus_2*allele & locus3_2*allele)
Row Y (locus1_1*allele & locus2_1*allele & locus3_1*allele) and (locus1_2*allele & locus_2*allele & locus3_2*allele)

and using GL String format would already be formatted as (for two subjects):

> Subject X locus1_1*allele~locus2_1*allele~locus3_1*allele+locus1_2*allele~locus_2*allele~locus3_2*allele
> Subject Y locus1_1*allele~locus2_1*allele~locus3_1*allele+locus1_2*allele~locus_2*allele~locus3_2*allele

all that would be needed (using the Haplotype Specific Homozygosity approach to calculating ALD) would be to count the variants that have already been entered into PyPop, because no EM is needed to generate the counts/frequencies. So phased LD should be easier to calculate than unphased LD.

alexlancaster / pypop