git clone https://github.com/boxiangliu/covseq.git
CoV-Seq is written and tested in python 3.7
, and should in theory work with any python 3
versions. CoV-Seq is not compatible with python 2
.
CoV-Seq has been tested on Ubuntu 18.04.4, macOS Catalina, and Windows 10.
NOTE: if you are using Windows, please refer to README_win_os.md
for more instructions before proceeding.
For Linux and Mac users, you will need to install bcftools
and htslib
by following instructions here.
For Windows users, please refer to README_win_os.md
for more instructions about dependencies.
You will also need to install the following Python packages:
click
biopython
hashlib
pandas [>=1.0.3]
matplotlib
seaborn
dateutil
werkzeug
We have made CoV-Seq very easy to use. To annotate a SARS-CoV-2 genome in FASTA format, go to the covseq
directory and run:
python annotation/annotation.py --fasta data/GZ8H0001.fasta --out_dir results
NOTE: Replace data/GZ8H0001.fasta
with your fasta file and results
with your desired output directory.
If the command was successful, there should 4 files in the results/
directory.
Guangzhou_GZ8H0001_2020_orf.tsv
Guangzhou_GZ8H0001_2020.vcf
Guangzhou_GZ8H0001_2020.snpEff.vcf
Guangzhou_GZ8H0001_2020.snpEff.tsv
The filename Guangzhou_GZ8H0001_2020
comes from the FASTA header in data/GZ8H0001.fasta
. Notice that characters "/" have been automatically replaced with "_" to save us from using escape characters. Here is a description of each file:
That's it. You can now use the VCF and annotations for downstream analysis.
For all options, run
annotation/annotation.py --help
In order to extract variants from many sequences all at once, use the batch processing script:
python3 vcf/fasta2vcf.py \
-f data/batch.fasta \ # The input fasta file
-r data/NC_045512.2.fasta \ # The reference fasta file
-o data/batch/ # The output directory
Replace the options with your fasta files and output directory. The reference fasta is the standard reference for SARS-CoV-2 and does not need to be changed.
In the output directory (data/batch/
in this case), there will be N pairs of *.vcf.gz
and *.vcf.gz.tbi
, where N is the number of input sequences in the input fasta file.
The following section describes the pipeline we used to batch process tens of thousands of FASTA sequences available for download on covseq.baidu.com/browse
.
The first step is to download data from repositories. All steps assume covseq
repo as the working directory.
To download sequence and metadata from GISAID:
../data/gisaid/fasta/
../data/gisaid/metadata/
To download data from NCBI:
../data/ncbi/fasta/
../data/ncbi/metadata/
To download data from EMBL:
../data/embl/fasta/
../data/embl/metadata/
To download data from CNGB:
../data/cngb/fasta/
../data/cngb/metadata/
We will preprocess the data by first concatenating all FASTA files and standardize FASTA headers
python3 preprocess/concatenate_fasta.py -i ../data/ -o ../data/aggregated/fasta/raw.fasta --cgnb_metadata <../data/cngb/metadata/CNGBdb_VirusDIP.csv>
Replace ../data/cngb/metadata/CNGBdb_VirusDIP.csv
with your own CNGB metadata file.
Next we will remove incomplete genomes (number of nucleotides < 25000).
python3 preprocess/filter_fasta.py -i ../data/aggregated/fasta/raw.fasta --out_dir ../processed_data/preprocess/filter_fasta/ --final_fn ../data/aggregated/fasta/preprocessed.fasta
Note that this command will create a ../processed_data/preprocess/filter_fasta/
to store intermedite files.
Now we have preprocessed FASTA files, let's call variants from nucleotide sequences, using the RefSeq sequence NC_045512.2
as the reference.
python3 vcf/fasta2vcf.py -f ../data/aggregated/fasta/preprocessed.fasta -r data/NC_045512.2.fasta -o ../data/aggregated/vcf/individual/
This command will create a directory called ../data/aggregated/vcf/individual/
, where VCF files from individual FASTA files will reside.
Some FASTA files have large numbers of sequencing errors and will produce abnormally long lists of variants. Let's filter them out.
python3 vcf/filter_samples.py -i ../data/aggregated/vcf/individual/ -o ../processed_data/vcf/filter_samples/ -c 150
Here we have chosen 150 as a default because a clear gap exists between samples with <150 mutations and samples with >1000 mutations with nothing in between. However, feel free to adjust this parameter to fit your needs.
Next we need to merge VCF files and normalize each record.
bash vcf/merge_vcfs.sh ../processed_data/vcf/filter_samples/ ../data/aggregated/vcf/merged/ data/NC_045512.2.fasta
Finally we will remove multi-allelic variants and variants within the poly-A tail.
bash vcf/filter_sites.sh ../data/aggregated/vcf/merged/merged.vcf.gz ../data/aggregated/vcf/merged/filtered
Finally we will annotate VCF files using snpEff, which is included in this Git repository.
bash snpEff/snpEff.sh ../data/aggregated/vcf/merged/filtered.vcf.gz ../data/aggregated/vcf/merged/annotated
python3 snpEff/parse_snpEff.py --vcf_fn ../data/aggregated/vcf/merged/annotated.vcf.gz --out_fn ../processed_data/snpEff/parse_snpEff/annotated.tsv
Last but not least we aggregate all metadata:
Replace ../data/embl/metadata/ena_sequence.txt
and ../data/cngb/metadata/CNGBdb_VirusDIP.csv
below to your own files.
# GISAID
python3 metadata/parse_gisaid_metadata.py --metadata_dir ../data/gisaid/metadata/acknowledgement/ -o ../data/aggregated/metadata/gisaid_acknowledgement.tsv --type acknowledgement
# NCBI
python3 metadata/parse_ncbi_metadata.py --gb_fn ../data/ncbi/metadata/sequence.gb -o ../data/aggregated/metadata/ncbi.tsv
# EMBL
python3 metadata/parse_embl_metadata.py --embl_fn <../data/embl/metadata/ena_sequence.txt> -o ../data/aggregated/metadata/embl.tsv
# CNGB
python3 metadata/rename_cngb_metadata.py --in_fn <../data/cngb/metadata/CNGBdb_VirusDIP.csv> --out_fn ../data/aggregated/metadata/cngb.tsv
python3 metadata/merge_metadata.py --in_dir ../data/aggregated/metadata/ --out_prefix ../data/aggregated/metadata/merged --vcf_fn ../data/aggregated/vcf/merged/merged.vcf.gz
All done! You have now generated all the data on the CoV-Seq website.
Please see FAQ here.
We welcome bug report here. Please help us by providing as much information as possible.
If you are running the source code, please provide the following:
If you are using the CoV-Seq website, please provide the following: