brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
255 stars 35 forks source link

cross individual contamination estimation #8

Open mflevine opened 5 years ago

mflevine commented 5 years ago

Hi,

I have been using Conpair in the past for concordance and cross individual contamination for tumor-normal pairs. Since you are keeping track of the genotype and allele frequency you should be able to do a similar calculation for the contamination. The Conpair method uses the normal as a prior when calculation contamination in the tumor (order of the pairs matters), so that might complicate things. Maybe only do when groups are provided.

Conpair Paper

brentp commented 5 years ago

hmm. this might be do-able. I will have a look.

vladsavelyev commented 5 years ago

It would be fantastic to have Conpair functionality in somalier (contamination, handling copy number variation). Conpair is great in what it's doing, but the codebase looks abandoned and one wouldn't want to stick to it long term.

brentp commented 5 years ago

Conpair doesn't find contamination where I expect it so I am having trouble validating my results. Does anyone have a T-N pair they can share that has contamination? I also can't get results from verifyBamID and I'm now trying verifyBamID2

mflevine commented 5 years ago

You can add a known amount of reads subsampled from a different individual to your tumor bam to simulate the cross individual contamination.

brentp commented 5 years ago

yes, I did that, but conpair does not find it!

mflevine commented 5 years ago

Huh and the reads cover the markers?

danielecook commented 4 years ago

Came here to suggest the same thing - it would be really cool if there could be a contamination estimation. @brentp - happy to help with this.

I'm not sure why a mixture of reads isn't working. I can test with my data (unfortunately, it would be difficult to share).

brentp commented 4 years ago

i didn't advertise this, but the current release will print pairs of samples where it finds contamination to stderr. i'd still need to do more work on it to make sure all is well, but I think it's fairly sane implementation.

danielecook commented 4 years ago

I can have a look and compare with conpair results if you like.

Is the stderr output when running relate?

I also spotted a typo in the instructions:

"set uknown genotypes"

brentp commented 4 years ago

hmm. actually looks like i commented it out. here's a binary with it back in and that typo fixed. somalier.gz

i'd be happy to hear about your experience with that.

brentp commented 4 years ago

and yes the output is when running relate

danielecook commented 4 years ago

Thanks I'll give it a go.

danielecook commented 4 years ago

Unfortunately I don't see any output.

I'll double check I'm using the right binary - but here is what I ran:

Version 0.2.2

somalier version: 0.2.2
Commands:
  extract      :   extract genotype-like information for a single sample from VCF/BAM/CRAM.
  relate       :   aggregate `extract`ed information and calculate relatedness among samples.
  depthview    :   plot per-chromosome depth for each sample for quick quality-control
  find-sites   :   create a new sites.vcf.gz file from a population VCF (this is rarely needed).

Output:

somalier version: 0.2.2
[somalier] collected sites from all 3280 samples
[somalier] time to calculate relatedness on 36988 usable sites: 4021.74
[somalier] wrote interactive HTML output to: somalier.html
[somalier] wrote groups to: somalier.groups.tsv
[somalier] wrote samples to: somalier.samples.tsv
[somalier] wrote pair-wise relatedness metrics to: somalier.pairs.tsv

Any idea why the usable sites is a float?

I'll have a look at your code and see if I can figure out what its doing.

Regardless - another phenomenal tool!

brentp commented 4 years ago

the message that you're referring to is?

[somalier] time to calculate relatedness on 36988 usable sites: 4021.74

this means it took 4021.74 seconds on 36988 sites (integer), so I don't see the problem.

Are you using a custom sites file? the recent ones have ~20K variants.

The code will only output a message if it finds pairs of samples that appear to have contamination, so it did not find any in your set.

danielecook commented 4 years ago

Sorry I misread the time message! Was a bit tired.

I am using a custom file. I'll have a look and see if I can figure out how you are estimating contamination. I imagine you are calculating some sort of contamination metric that I could output and compare with conpair?

brentp commented 4 years ago

here is the code: https://github.com/brentp/somalier/blob/master/src/somalierpkg/estimate_contamination.nim

danielecook commented 4 years ago

Haven't gotten a chance to take a look at this - but one other general thought...

The presence of even a small number of population SNPs as somatic variants is indicative of contamination be it from another sample OR from the lab. I'll see if there is a way this can be integrated unless you have already thought about this.

brentp commented 4 years ago

currently, somalier tries to use every other sample as a prior for contamination for a given sample so if we have e.g.:

#sample  allele-balance
A              0.0
B              0.5
C              0.02

and we are trying to determine who might have contaminated C, then we know that it cannot be A. and it could be B. Then there's a simple way to get the maximum likelihood estimate of contamination across a number of sites. For this site, we'd estimate e.g. 4% contamination of C from B.

We can also use the population allele frequencies of the sites vcf as a prior instead of or in addition to testing against all other samples. so if the site allele frequency for the case above was 0.2, then we'd treat that sorta like a sample with an allele balance of 0.2 (this is done by conpair or some other tool, I've now forgotten).

Anyway, this is the basis of the logic that's currently in somalier except that it does not yet do the population stuff using the AF from the sites vcf.

mflevine commented 4 years ago

I believe the Conpair does not try to estimate which samples contaminated your sample as you may not have the data from the contaminating samples. I believe it instead uses the deviation from expected allele balance for homozygous snps.

brentp commented 4 years ago

one of the tools I was reading about does it for normal only -- e.g. checking for normal contamination in tumor. and then a tool also does the population-based approach (maybe the same one). somalier has the advantage that it has all the genotypes for all samples so it can check all vs all.

danielecook commented 4 years ago

@brentp thanks for your response. I’ll probably get a chance to drill in a bit more early September.

I’ve identified cross contamination but somalier did not find it using the method I did. I can follow up next week.

I guess the major point I was making is you have the data to identify lab contamination if no cross contaminant is found, If you observe excess somatic vars that are pop snps that don’t match another sample.

As an aside, I’ve also looked at het/hom-alt ratios and this has been fruitful in identifying issues as well (germline / Tumor swaps of the same sample and lab contamination).

Thanks!

brentp commented 4 years ago

yes, I agree. with @danielecook and @mflevine , in peddy, we often think of the idr_baf (basically the width of the allele balance distribution) as a proxy for contamination. the number of het calls also works. i'm not sure what's the best approach in somalier. but I am open to suggestions. there are deficiencies in what's currently in somalier (which is why it was commented out). but yes, I very much welcome the help. i think you are onto something re pop snps. maybe somalier can check for contamination against any sample with a relatedness > 0.5 and against the population allele frequencies.

what was the level of contamination that somalier was missing?

danielecook commented 4 years ago

@brentp I don’t have exact numbers but it was on the lighter side. I actually looked at it using ExAC variants (-TCGA variants) on my own. I counted the number of non-exac to exac variants identified in tumors. When plotting exac v non-exac, uncontaminated samples will fit on a line corresponding to the rate at which mutation produces SNVs that match Exac variants by chance. Contaminated samples appear as outliers with a high number of exac matches. This was my initial simplistic way of looking at this and it works pretty well imo. As you suggested, integrating pop AF would also probably improve this idea further.

The “exac rate” (#exac/total) that I observed ranged from 2-20% for what I considered contaminated with most being toward the bottom of that range. I don’t recall the VAFs at the moment.

Will have a look with my datasets and hopefully provide some feedback in the next couple weeks.

brentp commented 4 years ago

@danielecook since you have so many samples, would you mind giving the dev version a try on your 3280 samples? the extract step is unchanged, but the relate step should be hundreds of times faster for nearly identical results.

let's also revisit this simplified contamination idea. I can implement and test.

Here is the new binary (I know you can build, but always nice to have a simple download).

(removed)

brentp commented 4 years ago

... here's the correct binary. somalier.gz

danielecook commented 4 years ago

@brentp - Definitely. Is the extracted file format identical? I just need to run relate? I can benchmark both in terms of time. I'll see if I can carve out some time tomorrow to look at contamination.

danielecook commented 4 years ago

Oh looks like we got a benchmark for the old version above nice!

brentp commented 4 years ago

yes, just run relate. extract is same. use the 2nd one. I'll remove the incorrect link.

danielecook commented 4 years ago

Got this run a lot quicker than I thought I would. There are an extra 8 samples but it was a lot faster.

somalier version: 0.2.3
[somalier] time to read files and get per-sample stats for 3288 samples: 5.14
[somalier] time to get expected relatedness from pedigree graph: 0.03
[somalier] time to calculate all vs all relatedness for all 5403828 combinations: 34.31
[somalier] wrote interactive HTML output to: somalier.html
[somalier] wrote groups to: somalier.groups.tsv
[somalier] wrote samples to: somalier.samples.tsv
[somalier] wrote pair-wise relatedness metrics to: somalier.pairs.tsv

real    1m35.053s
user    0m38.242s
sys     0m5.006s

I'll run with 2.2 to get the exact comparison.

brentp commented 4 years ago

looks like ~120X faster than your previous run. so as long as the output looks same as before, then it's a good change.

danielecook commented 4 years ago
somalier version: 0.2.2
[somalier] collected sites from all 3288 samples
[somalier] time to calculate relatedness on 36988 usable sites: 936.06
[somalier] wrote interactive HTML output to: somalier.html
[somalier] wrote groups to: somalier.groups.tsv
[somalier] wrote samples to: somalier.samples.tsv
[somalier] wrote pair-wise relatedness metrics to: somalier.pairs.tsv

real    16m13.543s
user    16m7.631s
sys     0m4.289s

Here it is with version 2.2. Looks more like 16x speedup. The difference may be because I built the old version without -d:release or possibly because I was running on a crowded node. These two benchmarks were run on exclusive nodes.

brentp commented 4 years ago

hmm. ok. 27X for the actual relatedness calculations. less than expected, but still nice. with recent nim, you'll need to compile with -d:danger as well.

I am going to try to implement the contamination checking you mentioned. I wonder if the p_middling_ab matches (some scale) your estimate of contamination. it calculates the proportion of sites that have an allele balance between 2%-20% and 80%-98%. I think using something that, but adjust for population allele frequency could be sufficient.

Your exact method:

counted the number of non-exac to exac variants identified in tumors

(if I understand correctly) can't work for the approach in somalier because all variants checked by somalier are in exac.

danielecook commented 4 years ago

@brentp -

Ah good point the exac method wouldn't work.

With regard to your proposed method - it seems to me you can use the p_middling_ab for all sites in the germline samples, but you would have to restrict the analysis in tumor samples to sites that are HOM-REF and HOM-ALT in the germline to account for CNVs.

I can run this on my samples so we can do a comparison between methods as you suggested.

brentp commented 4 years ago

if your samples are the same as the the ones you just ran, then the p_middling_ab column in $prefix.samples.tsv should have that value. I think it's ok to ignore the effect of CNV for this metric.

danielecook commented 4 years ago

Great - I will take a look. We observe considerable copy number variation in some samples which I would guess increases the rate of p_middling_ab without contamination. I'll compare with my metrics and see how things scale and also see whether copy number variation influences p_midding_ab in tumors.

Thanks!

brentp commented 4 years ago

I guess p_middling_ab could be scaled by the proportion of sites that are outside of mean-depth +/- $z standard deviations from the mean