galaxyproject / SARS-CoV-2

Ongoing analysis of COVID-19 using Galaxy, BioConda and public research infrastructures
https://covid19.galaxyproject.org
MIT License
128 stars 78 forks source link

Consensus genome and variant calling from Illumina only #141

Open pvanheus opened 4 years ago

pvanheus commented 4 years ago

Hi everyone

I have been working with the National Institute for Communicable Diseases (NICD) on some of their SARS-CoV-2 sequencing. I adapted the "variation" workflow to call variants in Illumina data and also produce an inferred consensus genome. The current "Assembly" workflow assumes that you have access to both Illumina and Nanopore genomes for a sample, which is a pretty rare situation. My workflow (with some TODOs in the step annotation) is in a gist. Comments and additions are welcome!

If it is found to be useful perhaps it can be incorporated into the COVID-19 resources page.

P.S. for those with ARTIC Amplicon data I created a workflow for analysing that as did Thanh le Viet. Pasting them here in case they are useful.

mvdbeek commented 4 years ago

Can you make the consensus part of you workflow a separate workflow ? Might be interesting to compare this to the samples that also appear in GISAID. I think we can add this to our current workflow (which is a subworkflow with a parallelized variant of the download step, and then runs the SE and PE workflows -- https://usegalaxy.org/u/sars-cov2-bot/w/download-and-sepe-illumina-covid-variation-workflow-imported-from-uploaded-file).

Can you describe the artic workflow in a bit more detail ? Why minimap, what's the bed file used for, why ivar, etc ? A bit like we do in https://covid19.galaxyproject.org/genomics/4-Variation/#how-do-we-call-variants ?

pvanheus commented 4 years ago

Hi @mvdbeek the only steps that are not involved in generating the consensus are the final "SnpSift Extract Fields" and "Collapse Collection" steps. Everything else goes into computing the consensus. So in the workflow you linked it would.

1) use the FASTA from Step 3: SnpEff build 2) use bedtools genomecov on Step 9: Realign reads 3) Filter to identify low coverage regions (low coverage should be a configurable parameter) 4) SnpSift Filter to filter variants by allelle frequency (also should be a configurable parameter) 5) use (1) and (4) as inputs to bcftools consensus 6) use (5) and (3) as inputs to bedtools MaskFastaBed to mask out low coverage regions 7) use sed on (6) to rename the FASTA to final consensus name

So it plugs quite naturally into what you are using. I imagine that you want to keep evolving the workflow to incorporate new insights into filtering (e.g. some of what is being discussed here).

pvanheus commented 4 years ago

Commenting separately on the ivar / amplicon story. The majority of genomes out there are being generated using ARTIC amplicon protocol detailed here for Illumina and here for ONT. My focus is the Illumina version.

The BED file is the location of ARTIC primers (there are 2 sets in common use - v.1 and v.3). This feeds into ivar trim to trim these out of the reads. ivar is specifically designed for this kind of amplicon sequencing, thus its use in the workflow. The choice of minimap2 and bwa-mem is somewhat arbitrary - mine was based on an approach Torsten Seemann at Doherty Institute in Melbourne was using, using minimap2. Thanh used bwa-mem in his version of the workflow.

I can write up a section for the web page but thought I should start the discussion here to iron out issues like the ones you've raised first.

wm75 commented 4 years ago

@pvanheus just proposed an update (#185) of the ARTIC WF in its dedicated section. I've extended its core part somewhat, but discarded the consensus calling part to highlight the similarity to the regular variation WFs. A separate WF showing consensus calling (not necessarily specific to ARTIC data) may still be good to have. We just need to decide on a section to add it to.

Slugger70 commented 4 years ago

Hi @wm75, I took the output of zip collection to pass to the rest of the WF instead of the individual read datasets from fastp and also put a flatten collection between Qualimap BAMQC and MultiQC so multiqc only runs once. My version of the WF is at:https://usegalaxy.org.au/u/simongladman/w/covid-19-variation-analysis-on-artic-pe-data

wm75 commented 4 years ago

@Slugger70 I think by passing a list:paired collection to bwa-mem, you may run into the same blocking WF scheduling issue that you're trying to avoid at the fastp step. At least that's my recollection. Have you checked that this parallelizes correctly?

wm75 commented 4 years ago

Thanks for spotting that MultiQC issue! I'll look into it for EU.

wm75 commented 4 years ago

I've updated the EU WF with the Flatten Collection step of the Qualimap output just as you were doing it for AU. Thanks again @Slugger70!