fgvieira / ngsDist

Estimation of pairwise distances under a probabilistic framework
GNU General Public License v3.0
10 stars 7 forks source link

More information about how pairwise deletion of missing data works? #15

Open TeresaPegan opened 2 years ago

TeresaPegan commented 2 years ago

Hello! I am interested in learning more about this parameter: --pairwise_del: pairwise deletion of missing data. I have noticed that using it has a very large influence on my results, but I can't find any more information about what it actually does. I assume it ignores sites where one or both samples have missing data? (Not just sites where both have missing data?)

How does it determine which data are "missing," within the genotype likelihood framework where all sites have likelihoods?

Thanks! -Teresa

fgvieira commented 2 years ago

By default ngsDist uses all sites.; that option ignores comparisons where one or both samples have missing data. It basically checks if the three likelihoods are the same (within a given margin).

TeresaPegan commented 2 years ago

Thanks! That answers my question nicely. I actually have a follow up question though: I was wondering whether, when using the pairwise_del option, it is possible to find out how many sites ngsDist discarded because there was missing data? (Or conversely, how many sites got used?)

The reason I ask is that I had been trying to run ngsDist on different chunks of the genome at the same time, to cut down on the time and memory requirements of any given job. In other words, I split my input beagle into many chunks and ran a job on each one independently. My logic was that if I run ngsDist on beagles with exactly equal numbers of rows (SNPs), then I could take the matrix produced by each job and average them to get the genome-wide genetic distance.

This does not work with the pairwise_del option, though, because different pairs and different chunks of the genome will all end up with different numbers of SNPs, so if I average a bunch of matrices, they are all improperly weighted and give incorrect results (the results are heavily correlated with how many sites are shared between each pair). If I could figure out how many sites were actually used by each pair in each job, that would give me the ability to apply the weights properly when combining the matrices.

I'll try to see if I can get the program to run on my entire dataset at once in a reasonable amount of time, which is obviously the simplest solution. At the same time, do you have any thoughts about whether it would be feasible to get information about how many sites get used when the --pairwise_del option is used?

Thanks! -Teresa

TeresaPegan commented 2 years ago

Actually, I was able to get ngsDist to run without breaking it into chunks, but I am still seeing an extremely tight correlation between the number of sites shared in common between two individuals and the resulting genetic distance. I am attaching a plot. Do you know what to make of this?

It seems like there is some kind of bias that gives lower genetic distance to pairs of individuals that had more sites covered. I don't really think there are any biologically plausible explanations for this...all of my data come from one very well-mixed species. I aligned them all to the same reference genome, which is a congener species (i.e. the reference genome is not much more closely related to some individuals than others). I don't really think I can rely on ngsDist results that appear to be so influenced by what sites each member of a pair have covered....

I cannot simply filter to include only sites that are covered by all individuals -- I am using low-coverage shotgun data and that would leave me with very few sites, most of which would probably be influenced by mapping errors (repetitive regions increasing the likelihood that the site is covered for literally every individual). I have already filtered my input data to include only SNP loci that are covered in at least 50% of individuals and loci where the minimum MAF is 5%.

image

(In the attached image, the x axis actually corresponds to the size (in mb) of a file that lists all of the sites the two species have in common, not an absolute measure of how many sites they share -- just a quick way for me to get a sense of that without having to wc -l about 800 pairwise site coverage files.)

Command to ngsDist:

$NGSDIST/ngsDist \
--geno $IN \
--probs \
--n_ind $NIND \
--n_threads 10 \
--n_sites $NSITES \
--pairwise_del \
--out $DISTDIR/$SP"_"$SELECT"_ngsDist_pairdel_fullgenome"
fgvieira commented 2 years ago

That is indeed strange. How did you generate the beagle file? Do you also see this trend without --pairwise_del?

TeresaPegan commented 2 years ago

Hmm, now I see that this pattern exists without --pairwise_del as well. image

I generated the beagle file through a multi-step process. I originally created a beagle file of my entire genome in ANGSD like this: $ANGSDIR/angsd -b $BAMS -out $OUT -doMaf 1 -doGlf 2 -doMajorMinor 1 -anc $ANC \ -minMapQ 30 -minQ 30 -doPost 2 -GL 2 -dosnpstat 1 -nThreads 4 -doCounts 1 -doDepth 1 -dumpCounts 2 \ -doHWE 1 -maxHetFreq 1.0 -SNP_pval 1.0

This provides me with information I need to filter the beagle file without actually discarding any sites, so that I don't have to re-do the lengthy ANGSD GL calculation every time I want to tweak a filter. I used awk to create a filtered beagle file that only includes SNP loci, ran that dataset through ngsParalog, and then used awk to filter the SNP beagle to get rid of SNP sites that ngsParalog suggests are mismapped. All of these filters are influencing which loci get included in the beagle file, but all individuals end up with the same number of sites. (Except that some have missing data, of course, when they have no reads at certain sites). I used the resulting beagle file with ngsDist.

This leads to another question: is it inappropriate to run ngsDist on only variant (SNP) sites, as I am doing here? I know if affects the interpretation of what the raw genetic distance number means, but should the relative values be trustworthy otherwise? The reason I have been using only SNP sites is simply that it's computationally difficult to run the program on the entire 1 GB genome. But maybe this is problematic.

I will try running ngsDist on one of my chromosomes including all sites (variant and invariant) and see what happens!

Thanks so much for your help! -Teresa

TeresaPegan commented 2 years ago

It looks like the problem is very much reduced in this example run I did with a single chromosome, not filtering to include only SNP loci first. (Although it appears there is still something of a negative trend). It seems like I should not be running the program on data I have filtered to include only SNP sites.

image

TeresaPegan commented 2 years ago

Another update: here is a comparison between pairwise deletion and no pairwise deletion for a much larger chromosome than the one I showed in my previous plot. In this case I also did no filtering of the chromosome before running the program, it includes invariant sites.

As you can see, the correlation between genetic distance and cogenotyping rate is still very strong when I don't use the pairwise delete option. There is also a negative trend when I do use pairwise delete, but less extreme.

I will definitely stick with using all sites (not just SNPs) and using pairwise_del from now on!

Someone recently pointed me toward this paper about a program designed to help alleviate biases in estimates of pi introduced by missing data, such as those described in the paper's figure 1. I was wondering if these biases might be the reason why this strong pattern between genetic distance and cogenotyping exists when pairwise_del is not used. Does ngsDist (as a program designed for genotype likelihoods) account for these biases that come from missing data? Perhaps it only does so when the pairwise_del function is used?

Thanks! -Teresa

image

fgvieira commented 2 years ago

Hi Teresa,

first of all thanks for your in depth analysis and by sharing it here. 😄

Do your samples have similar coverages, or do you have samples with higher coverage than others? What is your samples average coverage?

From what you are saying, and just to sum up, it seems that: pairwise del all sites snps only
no strong negative correlat strong negative correlat
yes slight neg correl strong negative correlat

Where negative correlation means: more sites in common (prob due to higher coverage) -> lower genetic distance

The distance between two positions where at least one is missing data will tend to be higher than zero. If so, with no pairwise deletion, samples with few sites in common (high missing data) will have a higher fraction of these sites, leading to an overestimation of the genetic distance, specially with all sites (where most sites are most likely not variable).

With pairwise deletion, the only thing I can think of, is that samples with few sites in common (high missing data) will have their distance based on considerably fewer sites, specially with snps only (where there are already fewer sites to start with).

TeresaPegan commented 2 years ago

Hi, thanks for looking into this! My samples do have coverage variation, but it is not very extreme. Most of the samples are sequenced at approximately 3-5x, though I do have outliers that sequenced at more like 2x, and up to about 10x.

The table you included above to summarize my situation is accurate!

Given that it seems important not to restrict the analysis to SNP sites, I will probably make use of the advice you mentioned here (https://github.com/fgvieira/ngsDist/issues/8 and #11) for breaking up the job to reduce memory requirements per-job.

Thanks! -Teresa

TeresaPegan commented 2 years ago

Ah, but actually I cannot use the --tot_sites option in combination with pairwise delete, which I guess makes sense:

cannot specify total number of sites (--tot_sites) with pairwise deletion (--pairwise_del)!

This strongly limits the size of the dataset I can use to get calculate genetic distance. I recently tried on a 70mb chromosome and it was failing out of memory after over an hour when I gave it 150GB of RAM. I will probably realistically be restricted to something like a 30-50 mb chunk.

In most cases ~30-50mb is probably enough to get some kind of sense of genetic distance between individuals, but it's also an extremely tiny fraction of the data I have. If there are any other possibilities for ways to get a mean from the output across different runs of ngsDist, please let me know! Would it be possible/feasible to add an option where ngsDist prints out 1) a matrix how many sites --pairwise_del retained, and 2) a matrix of the total differences between individuals? The genetic distance calculated here is simply (differences between individuals) / (total sites), right? If so, I think those two matrices would be sufficient to allow calculation of genetic distance across multiple jobs. You would take the sum of the matrix of differences (as you suggested in #8), and you could use the matrix of sites retained to calculate genetic distance for each pair.

I wish I knew C++ to be able to suggest some actual changes in the code! (It seems like a useful language to learn for bioinformatics, I am planning to try to learn it).

Thanks for considering! -Teresa

fgvieira commented 2 years ago

I can try changing the code so that the distance matrix has the distance in the lower diagonal and the number of sites in the upper diagonal (or the other way around), but not sure when I'll be able to do it.

In the meantime, you can also give it a try with distAngsd and see if you see the same biases.