Analysis of SARS-CoV-2 genome variants collected with freebayes variant caller, relevant to this publication: https://doi.org/10.3389/fmicb.2021.665041.
requires miniconda, python2.7 and python>=3. To install miniconda, see: https://docs.conda.io/en/latest/miniconda.html
git clone https://github.com/cfarkas/SARS-CoV-2-freebayes.git # clone repository
cd SARS-CoV-2-freebayes
conda config --add channels conda-forge # add conda-forge channel (if you haven't already done so)
conda config --add channels bioconda # add bioconda channel (if you haven't already done so)
conda env update --file environment.yml # install required programs
conda activate SARS-CoV-2-freebayes # activate SARS-CoV-2-freebayes enviroment
bash makefile.sh # make & install
sudo cp ./bin/* /usr/local/bin/ # Copy binaries into /usr/local/bin/ (require sudo privileges)
After these steps, a conda enviroment called SARS-CoV-2-freebayes can be managed as follows:
# To activate this environment, use
#
# $ conda activate SARS-CoV-2-freebayes
#
# To deactivate an active environment, use
#
# $ conda deactivate
By activating the enviroment, all scripts in the SARS-CoV-2-freebayes repository can be executed, without further installations.
Finally, we need to install a vcftools version that includes the --haploid flag as follows:
git clone https://github.com/cfarkas/vcftools.git # Forked Julien Y. Dutheil version: https://github.com/jydu/vcftools
cd vcftools
./autogen.sh
./configure
make # make
sudo make install # requires sudo privileges
and the latest version of sra toolkit. Please see instructions here:
https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit
For users working with small number of genomes (i.e.: < 60000) or have unlimited values of -s and -n parameters in their machines, use "nolimit" versions of binaries:
SARS-CoV-2-NGS-freebayes-nolimit
SARS-CoV-2-GISAID-freebayes-nolimit
SARS-CoV-2-FASTA-freebayes-nolimit ("Agnostic" FASTA variant caller)
SARS-CoV-2-processing-fasta-nolimit
SARS-CoV-2-merge-variants-nolimit
For a proper performance working with high number of genomes (i.e. > 60000), use these binaries:
SARS-CoV-2-NGS-freebayes
SARS-CoV-2-GISAID-freebayes
SARS-CoV-2-FASTA-freebayes ("Agnostic" FASTA variant caller)
SARS-CoV-2-processing-fasta
SARS-CoV-2-merge-variants
The latter binaries asume you can set these values:
ulimit -n 1000000 # Check your machine with: ulimit -n
ulimit -s 1000000 # Check your machine with: ulimit -s
Check README_ulimit for how to change these values in Ubuntu.
In order to obtain SARS-CoV-2 variants (viral frequency >= 0.5) users need to use SARS-CoV-2-NGS-freebayes and provide:
Assuming that binaries are in /usr/local/bin
, users can do the following:
samtools faidx covid19-refseq.fasta
SARS-CoV-2-NGS-freebayes -l <SRA_list> -g <covid19-refseq.fasta> -a <VF> -t <Threads>
This execution will:
We provided a curated list of 17560 SARS-CoV-2 worldwide datasets until July 28, 2020 (see SRA_Accessions_Jul_28_2020.tabular). We also provided curated lists in txt format by continent (see July_282020*.txt files). As an example, we will collect variants from July_28_2020_North_America.txt datasets using 30 threads. In Ubuntu it is recommended to increase open file limit and stack size accordingly at the number of genomes to process, otherwise these steps may crush. see README_ulimit for details in Ubuntu. In a given folder:
ulimit -n 1000000 && ulimit -s 1000000 # check if you can increase stack size and open file limit, see README_ulimit for details.
# Execute the pipeline, providing full path to July_28_2020_North_America.txt and covid19-refseq.fasta files:
SARS-CoV-2-NGS-freebayes -l July_28_2020_North_America.txt -g covid19-refseq.fasta -a 0.4999 -t 30
will collect variants (VF>=0.5) in each Sample.
In order to obtain SARS-CoV-2 variants from FASTA GISAID genomes, users need to use SARS-CoV-2-GISAID-freebayes and provide:
Assuming that binaries are in /usr/local/bin
, do the following:
samtools faidx /full/path/to/covid19-refseq.fasta
# Execute the pipeline, providing full path to merged.GISAID.fasta and covid19-refseq.fasta sequences:
ulimit -n 1000000 && ulimit -s 1000000 # check if you can increase stack size and open file limit, see README_ulimit for details.
SARS-CoV-2-GISAID-freebayes -f /full/path/to/merged.GISAID.fasta -g /full/path/to/covid19-refseq.fasta -t 10
samtools faidx /full/path/to/covid19-refseq.fasta
# Execute the pipeline, providing full path to merged.GISAID.fasta and covid19-refseq.fasta sequences:
ulimit -n 1000000 && ulimit -s 1000000 # check if you can increase stack size and open file limit, see README_ulimit for details.
SARS-CoV-2-FASTA-freebayes -f /full/path/to/merged.fasta -g /full/path/to/covid19-refseq.fasta -t 10
in which merged.fasta file contains all sequences in a single file. To produce this file, users can place in a folder all FASTA genomes to analyze and do:
cat *.fasta > merged.fasta
Assuming users previously runned samtools faidx /full/path/to/covid19-refseq.fasta
and binaries are in /usr/local/bin
, do:
ulimit -n 1000000 && ulimit -s 1000000 # check if you can set these values
SARS-CoV-2-GISAID-freebayes -f merged.GISAID.fasta -g covid19-refseq.fasta -t 10 # call variants, provide full path to files
vcftools --vcf merged.GISAID.AF.vcf --window-pi 50 --haploid --out merged.50 # 50 bp sliding window
vcftools --vcf merged.GISAID.AF.vcf --TajimaD 50 --haploid --out merged.50 # 50 bp sliding window
# joining pi with Tajima's D values and calculate Tajima's D confidence intervals
awk '{print $1"\t"$3"\t"$4"\t"$5}' merged.50.windowed.pi > merged.50.subset.windowed.pi
join -j 2 <(sort -k2 merged.50.subset.windowed.pi) <(sort -k2 merged.50.Tajima.D) -o 1.2,1.4,2.4> joined && sed -i 's/ /\t/'g joined
sort -k1 -n joined > merged.50.pi.D && rm joined merged.50.subset.windowed.pi
pi-tajima -f merged.50.pi.D -v merged.GISAID.AF.vcf # Run pi-tajima
The file "merged.50.pi.D" contains binned π and Tajima's D values and can be plotted in any sofware.
check "postprocessing_pi_D_output_files" subdirectory, containing:
merged.GISAID.AF.vcf files, containing aggregated variants, can be annotated by SnpEff here in Galaxy: https://toolshed.g2.bx.psu.edu/repository?repository_id=ec3193dd441340fc. These files can be further processed to obtain a summary of variants per viral gene. As an example, we will analyze worldwide GISAID variants until November 30, 2020 as follows:
Assuming binaries are in /usr/local/bin
, users previously runned samtools faidx covid19-refseq.fasta
and the merged VCF input file is called SnpEff-Nov-30-2020.GISAID.vcf
:
mkdir SnpEff-SARS-CoV-2
cd SnpEff-SARS-CoV-2
SnpEff_processing -v SnpEff-Nov-30-2020.GISAID.vcf
In each folder, variants_per_protein subfolder contain variants per protein. Files ending in ".SnpEff" contains parsed variants per consequence and ".counts" contains associated counts, respectively. Also, the script computed aminoacid changes (see SnpEff.AAchanges files). We suggest user-provided vcf files should be processed in a likewise manner, placing the vcf file in a specific folder and executing steps 3) and 4).
please visit our wiki page here: https://github.com/cfarkas/SARS-CoV-2-freebayes/wiki#i-two-step-full-pipeline