genomic-medicine-sweden / tomte

A nextflow pipeline for analysing expression and splicing in RNA seq data from rare disease patient
MIT License
12 stars 3 forks source link

RNA-seq contamination check #163

Open Jakob37 opened 1 month ago

Jakob37 commented 1 month ago

Description of feature

Hi from Lund!

We discussed Tomte today, and what would be needed for us to get it into production.

One thing that came up is that we would want a contamination check. This is done in a separate RNA-seq pipeline here in Lund by selecting a set of ~200 sites in "housekeeping genes", calling these and checking for patterns in heterozygosity. I.e. if these patterns align with what would be expected from a pure or contaminated sample.

We would include these results in a QC report to give an indication of risk for contamination.

What do you guys say about having something similar added to Tomte?

jemten commented 1 month ago

Sounds like a cool idea to me. There is a variant calling part of tomte. Do you want to take the vcf generated there, extract the calls in housekeeping genes and run your heterozygosity check or would you like to do a separate VC for this?

Jakob37 commented 1 month ago

Sounds like a cool idea to me. There is a variant calling part of tomte. Do you want to take the vcf generated there, extract the calls in housekeeping genes and run your heterozygosity check or would you like to do a separate VC for this?

Yes, I was wondering the same actually. I think I understood it as that they currently do DNAScope for targeted sites, in addition to the regular calls. But I don't know if there is a reason for not just reusing a subset of the already done calls, and feed these into the contamination check script. I'll ask around.

Jakob37 commented 1 month ago

Sounds like we are doing a specific calling step for the set of variants to be able to get 0/0 calls in the output as well. The rationale here is to be sure that they were successfully called, and not just absent due to lacking coverage.

I think we are using DNAScope for this at the moment. But would guess GATK's haplotype caller might work as well.

After that we do a post-processing step using these calls to estimate the contamination.