Illumina / strelka

Strelka2 germline and somatic small variant caller
GNU General Public License v3.0
355 stars 102 forks source link

Using Strelka2 for Allele-Specific Expression (ASE) - RNAseq #213

Open gpalou4 opened 2 years ago

gpalou4 commented 2 years ago

Hello,

I was testing Strelka2 for the use of Allele-Specific Expression (ASE).

In more detail I first run Strelka germline workflow with DNAseq to call variants. Then, I re-run it with RNAseq data, --forcedGTstrelka_DNAseq.vcf to count only the known variants called previously, and --rna flags. I had no error and it run smoothly, but after checking the output I have a couple of questions:

1) Is strelka genotyping again the variants given by --forcedGT using RNAseq data? I would like to only count the variants using RNAseq, not to genotype again. This is because it will be loosing many variants with very low counts, but it doesn't mean they do not exist. I counted ~20-30% of variants with no ALT allele on the output, meaning not genotyped/counted. In particular all the variants where ALT == 0 are missing. I'm really interested in these variants, whose gene might be targeted by nonsense-mediated mRNA decay pathway or such (it is expected that ALT allele is ~ 0).

2) The other question is regarding the internal filterings of Strelka to call a variant. I tried to use different mapping quality (MAPQ) values to remove reads on the BAM, before using Strelka. I'm using STAR for alignment, so it has five MAPQ values (0,1,2,3 for multi-mapping reads, and 255 for unique mapping reads). The thing I noticed is, when lowering the MAPQ threshold, and thus having more initial BAM reads, I ended up with less variants in the strelka output vcf. Is this related with how strelka uses reads to determine where a variant has been called correctly? My hypothesis is that strelka uses the average MAPQ across reads, and by adding low MAPQ reads it might be worse in some cases, or it might be because one of the pairs of reads is has low MAPQ and it removes both reads. Is this situation expected or am I doing something wrong?

3) The counting of variants using RNAseq is pretty decent as far as I have checked, except for the issue described in 1). However, the counting of INDELs is not that good. Most of them (+90%) are actually not counted at all. Is this something expected too, as the software is not ideally focused on INDELs with RNAseq?

Thanks a lot in advance, I love the tool!

Best,

Guille