Pipeline to process SARS-CoV-2 sequences and metadata, clean up irregularities, align and variant call then publish matched subsets of FASTA sequences and metadata for groups with different access to sensitive data.
Runs weekly on global sequences downloaded from GISAID.
Runs daily on COG-UK sequences, and combines with non-UK GISAID sequences.
git clone --recurse-submodules https://github.com/COG-UK/datapipe.git
cd datapipe
conda env create -f environment.yml
conda activate datapipe
NXF_VER=20.10.0 nextflow run workflows/process_cog_uk.nf <params>
Parse GISAID dump (export.json
) and extract FASTA of sequences and associated metadata.
gisaid_omissions.txt
covv_host.lower() != 'human'
YYYY-MM-DD
) or impossible (earlier than 2019-11-30
or later than today) date in covv_collection_date
epi-week
and epi-day
columns to metadataRun pangolin
(https://github.com/cov-lineages/pangolin) on all new sequences. If new release of pangolin
run on all sequences.
Calculate the unmapped_genome_completeness
as the proportion of sequence length which is unambiguous (not N
)
Deduplicate by date, keeping the earliest example
Align to the reference (Wuhan/WH04/2020
) with minimap2
Variant call using gofasta
and type specific mutations of interest listed in AAs.csv
and dels.csv
Filter out low quality sequences with mapped completeness < 93%, and trim and pad alignment outside of reference coordinates 265:29674
Calculate distance to reference and exclude sequences with distance to more than 4.0 epi-week std devs.
Parse matched FASTA and metadata TSV output by Elan/Majora
date_corrections.csv
resequencing_omissions.txt
YYYY-MM-DD
) or impossible (earlier than 2019-11-30
or later than today) date in covv_collection_date
epi-week
and epi-day
, source_id
and pillar_2
columns to metadataRun pangolin
(https://github.com/cov-lineages/pangolin) on all new sequences. If new release of pangolin
run on all sequences.
Calculate the unmapped_genome_completeness
as the proportion of sequence length which is unambiguous (not N
)
Deduplicate COG-ID by completeness and label samples with duplicate source_id
Align to the reference (Wuhan/WH04/2020
) with minimap2
Variant call using gofasta
and type specific mutations of interest listed in AAs.csv
and dels.csv
Filter out low quality sequences with mapped completeness < 93%, and trim and pad alignment outside of reference coordinates 265:29674
Clean up geographical metadata (https://github.com/COG-UK/geography_cleaning)
Combine COG-UK sequences and metadata with non-UK GISAID sequences and metadata
Publish subsets of the data as described in publish_cog_global_recipes.json
sample_name
is composed of the country (England/Scotland/Wales/Northern_Ireland), central_sample_id and year and represents a short informative name. sample_date
is the collection_date
if provided, and otherwise the received_date
. At least one of these is required to submit metadata.epi_week
and epi_day
convert the sample_date to the pandemic-week and day in which it falls, with week commencing 2019-12-22 as week 0. This represents the week of the earliest sequenced genomes, although modelling suggests the pandemic began earlier.source_id
combines root_biosample_source_id
or biosample_source_id
(with preference for root_biosample_source_id
). Both represent samples from the same patient source, but are completed by different sequencing teams.is_pillar_2
is set if collection_pillar is specified as 2, or if central_sample_id has been generated by a known pillar 2 organisation. This is indicative of surveillance sequencing as opposed to targeted hospital sequencing.uk_lineage
column from the previous phylopipe outputgrapevine
(https://github.com/COG-UK/grapevine) was the name of the original pipeline which did all of the above, made phylogenetic trees and more. As the number of sequences has grown the tree building steps take increasingly long to complete. As the majority of users only interact with the alignments and cleaned metadata, it was decided that a robust implementation of the alignment and metadata processing steps run daily would be more useful and that is what is provided here.