galaxyproject / SARS-CoV-2

Ongoing analysis of COVID-19 using Galaxy, BioConda and public research infrastructures
https://covid19.galaxyproject.org
MIT License
128 stars 78 forks source link

Missing Indels #49

Open nathanhaigh opened 4 years ago

nathanhaigh commented 4 years ago

My own analysis of Illumina (SRR11140750, bottom track) and nanopore (SRR11140751, top track) data from the same swab sample shows your variant analysis doesn't include indels:

image

You probably should include --call-indels in your call to lofreq call

nekrut commented 4 years ago

what is your protocol?

nathanhaigh commented 4 years ago

For these mappings, it's simply bwa mem for the short reads and minimap2 for the nanopore reads.

Given the small size of the genome, I just eye-balled possible variants and saw this. Then realised you don't call indels in your variant analysis.

This is the content I've developed for a Master/Major in Bioinformatics Genomics course: https://uofabioinformaticshub.github.io/genomics_applications/Practicals/resequencing/resequencing.html

nathanhaigh commented 4 years ago

Perhaps you are already calling variants. However, they are missing from these outputs: https://github.com/galaxyproject/SARS-CoV-2/tree/master/4-Variation#outputs

wm75 commented 4 years ago

The current workflow fails to call indels because it does not use lofreq indelqual to add indel qualities to the mapped reads first. Without those lofreq call skips indels even when the --call-indels option is in use (which is the case already).

@nekrut are you interested in having indels get called and do want to update the workflow accordingly yourself?

saramonzon commented 4 years ago

Hi! I was just stopping by to say this, and you already did! Indels should be called, as we have observed a bunch of them in covid19 data. Moreover they are indels that conserve the open reading frame, only deleting with aa, which makes it a most probably functional variant.

Also I think you don't filter host reads in the variant pipeline, but I'm not sure if you do it in the assembly. We have observed that bwa mem is two much sensitive and softclips a lot of reads against the human genome, bowtie2 is much better option for this type of data, specially if you are using amplicon data.

Thanks for the repo!

tseemann commented 4 years ago

@saramonzon do you have evidence for host reads (human) aligning to the SARS genome?
It could be possible with bwa mem default settings like minscore=30 but that should not be used here.

You can reduce soft clipping by changing the end-bonus settings, or as you say, using glocal alignment via bowtie2 --end-to-end

saramonzon commented 4 years ago

Hi @tseemann, yes using bwa with default parameters we obtain a considerably percentage of reads mapping to both human and SARS-Cov-2 depending on the sample, we have fixed it as you say using bowtie2, but using --local not --end-to-end. But It's more likely that there are more virus reads mapping to human genome, than human reads mapping against SARS-Cov-2, however it could be a possibility that is worth to be avoided just in case. I share here my notes when I was analyzing this just in case someone finds them useful, there are just personal notes, sorry about the writing! README.txt