alexdobin / STAR

RNA-seq aligner
MIT License
1.78k stars 498 forks source link

WASP with indels #591

Open kwcurrin opened 5 years ago

kwcurrin commented 5 years ago

Hello, it looks like STAR-WASP only considers SNPs and not indels in the VCF files. Is this correct?

Thanks!

Kevin

alexdobin commented 5 years ago

Hi Kevin,

like the original WASP, STAR-WASP only works with SNVs.

Cheers Alex

kwcurrin commented 5 years ago

Hi Alex,

While WASP only uses SNPs in the remapping step, it performs a pre-filtering step where reads overlapping indels are removed. Does STAR do this pre-filtering step also?

Thanks!

Kevin

alexdobin commented 5 years ago

Hi Kevin,

no, the indel filtering is not included at the moment. Does it make a significant difference?

Cheers Alex

kwcurrin commented 5 years ago

Hi Alex,

I haven't tested it directly, but when I used WASP on one set of samples without indels in the VCF file, the average reference allele ratio was ~0.53 for each sample after WASP mapping. On a different set of samples ran with indels in the VCF files, the average reference allele ratio was 0.50 for each sample. I can't say for sure that indels were the reason for this, but it is my best explanation. Theoretically keeping with that overlapping indels could lead to reference allele bias if the indels aren't included in the remapping, but does it account for 3% reference allele bias on its own? I'm not sure.

Kevin

kwcurrin commented 5 years ago

Sorry for the typo: "Theoretically keeping reads overlapping indels could lead to reference allele bias if the indels aren't included in the remapping..."

alexdobin commented 5 years ago

Hi Kevin,

I guess a better test would be to run WASP with and without indels in the VCF on the same data, and see how much difference it makes. Also, you could filter out those SNVs which have indels closer than the read length to them.

Cheers Alex

zitoa commented 4 years ago

Hi both, interesting discussion.

Alex, while the the inclusion of INDELS could not necessarily lead to mapping biases, it will affect both the alignment score (AS) and mapping quality (MAPQ). WASP was originally designed to correct for mapping biases affecting SNPs, not INDELs, so I was wondering if would it be better to pre-filter the VCF in order to keep only biallelic variants. I am currently working on this and dealing with an error concerning the VCF file format. Does STAR require the VCF as a plain text file of variants (like in the original WASP implementation) or does it require the whole VCF? Both seems to do not work for me at the moment:

Thanks, Antonino

alexdobin commented 4 years ago

Hi Antonio,

presently, STAR does not read the indels from the VCF file. However, to avoid the indel-induced biases as suggested by Kevin, you could filter the SNVs that are close to the indels. I am not sure how much effect it has on RNA-seq data. I am planning to add an option to filter out the reads that overlap indels.

The VCF files have to be standard VCF, with 9 fields, with the genotype (9th column) of 0/1, 1/1, 1/0.

Cheers Alex

zitoa commented 4 years ago

Hi Alex, I have perhaps filtered out all INDELS from the VCF and not only those close to SNVs as multimapping reads may occur and still involve both SNV and INDELS. Antonino

alexdobin commented 4 years ago

Hi Antonio,

sure, at the moment indels can be simply ignored. I have to think a bit what's the best way of fully incorporating them.

Cheers Alex

VitorAguiar commented 1 year ago

Interesting. I see them same reference bias of ~0.53, which is consistent across samples. Any updates or recommendations on this?

alexdobin commented 1 year ago

Hi Vitor,

we are working on a scheme to incorporate indels into a diploid genome, which will be ready in 1-2 months.

VitorAguiar commented 8 months ago

Hi @alexdobin, I think the feature that you mentioned is already implemented at this point. I'm curious about how it affects WASP.

I think your idea is to map reads to a personalized diploid genome accounting for indels carried by an individual, thus mapping better the reads overlapping indels. Then, in theory, we wouldn't need to filter out those reads anymore. Is that right?

alexdobin commented 8 months ago

Hi Vitor,

Indeed, mapping to the diploid genome will eliminate reference bias and improve the accuracy of mapping reads containing indels. WASP approach may still further increase accuracy; however, it presently does not deal with reads overlapping indels.