brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
262 stars 35 forks source link

ENH: add read_haps algorithm to estimate contamination #65

Open brentp opened 3 years ago

brentp commented 3 years ago

https://www.biorxiv.org/content/10.1101/2020.02.11.941773v2

for each of the ~17K snps, we can look for 3 haplotypes (when input is bam/cram). maybe @hannespetur could weigh in on whether this is likely to be enough sites.

hannespetur commented 3 years ago

Hey, you would need select the ~17k SNP a bit differently such that you have more SNPs in close proximity (less than the insert size) to each other because read_haps algorithm estimates the contamination from pairs of heterozygous SNPs seen on the same read/mate read. I did a quick check with awk on sites.hg38.vcf.gz and I found only 2 pairs of SNPs within 500 bp from each other so that's no good (assuming I didn't mess up my script).

Pinging @bjarnivh in case he has anything more to add.

Best, Hannes

bjarnivh commented 3 years ago

It is the number of pairs of SNPs that share a read pair that is the important factor to get read_haps to work. We give a few million SNPs with read_haps. This is probably way too much in practice and you could probably subsample tha but ensuring that you are subsampling the SNPs in fairly dense clusters.

My guess is that 17k would work if you were to select ~1700 clusters with ~10SNPs each, keeping the cluster distance < 1kb. You should be able to subsample these from the dataset that we provide. You should then make sure to have the clusters fairly well separated so they are not influenced by a single SV.

brentp commented 3 years ago

Thanks for commenting! To clarify, the 17K snps are chosen to be distant, but for each SNP, I can look in the vicinity for any other SNP/variant. So, I was thinking I could see what percent of these 17K sites are within ~400bp (or typical fragment length) of another common snp. If there are a sufficient number, then I could make sure to also assay those neighboring sites for each sample. Does that seem reasonable?

(I could also try to adjust the site-selection to maximize SNPs with another common SNP within $fragment_length).

bjarnivh commented 3 years ago

Yes, that would work. You wouldn't need to do it for all the 17k SNPs and it is better to have a cluster of few SNPs together. You also want to be absolutely sure that the SNPs are high quality SNPs, so I would recommend using our SNPs set or using similarly strict filtering criteria for selecting the SNPs used.

brentp commented 3 years ago

ok. I will carve out some time to give it a try. If I can choose sites that are within $read_length of each SNP, then this could be done with hardly any performance penalty. I'll start with your SNP set.