Open nathanhaigh opened 4 years ago
what is your protocol?
For these mappings, it's simply bwa mem
for the short reads and minimap2
for the nanopore reads.
Given the small size of the genome, I just eye-balled possible variants and saw this. Then realised you don't call indels in your variant analysis.
This is the content I've developed for a Master/Major in Bioinformatics Genomics course: https://uofabioinformaticshub.github.io/genomics_applications/Practicals/resequencing/resequencing.html
Perhaps you are already calling variants. However, they are missing from these outputs: https://github.com/galaxyproject/SARS-CoV-2/tree/master/4-Variation#outputs
The current workflow fails to call indels because it does not use lofreq indelqual to add indel qualities to the mapped reads first. Without those lofreq call skips indels even when the --call-indels
option is in use (which is the case already).
@nekrut are you interested in having indels get called and do want to update the workflow accordingly yourself?
Hi! I was just stopping by to say this, and you already did! Indels should be called, as we have observed a bunch of them in covid19 data. Moreover they are indels that conserve the open reading frame, only deleting with aa, which makes it a most probably functional variant.
Also I think you don't filter host reads in the variant pipeline, but I'm not sure if you do it in the assembly. We have observed that bwa mem is two much sensitive and softclips a lot of reads against the human genome, bowtie2 is much better option for this type of data, specially if you are using amplicon data.
Thanks for the repo!
@saramonzon do you have evidence for host reads (human) aligning to the SARS genome?
It could be possible with bwa mem
default settings like minscore=30 but that should not be used here.
You can reduce soft clipping by changing the end-bonus settings, or as you say, using glocal alignment via bowtie2 --end-to-end
Hi @tseemann, yes using bwa with default parameters we obtain a considerably percentage of reads mapping to both human and SARS-Cov-2 depending on the sample, we have fixed it as you say using bowtie2, but using --local
not --end-to-end
.
But It's more likely that there are more virus reads mapping to human genome, than human reads mapping against SARS-Cov-2, however it could be a possibility that is worth to be avoided just in case.
I share here my notes when I was analyzing this just in case someone finds them useful, there are just personal notes, sorry about the writing!
README.txt
My own analysis of Illumina (
SRR11140750
, bottom track) and nanopore (SRR11140751
, top track) data from the same swab sample shows your variant analysis doesn't include indels:You probably should include
--call-indels
in your call tolofreq call