cfarkas / SARS-CoV-2-freebayes

Analysis of SARS-CoV-2 genome variants collected with freebayes variant caller
MIT License
8 stars 3 forks source link

SARS-CoV-2-freebayes

Analysis of SARS-CoV-2 genome variants collected with freebayes variant caller, relevant to this publication: https://doi.org/10.3389/fmicb.2021.665041.

Installation:

Option 1: Via conda/pip (recommended)

Option 2: Without using conda, program by program:

Note of performance and ulimit before using the binaries

For users working with small number of genomes (i.e.: < 60000) or have unlimited values of -s and -n parameters in their machines, use "nolimit" versions of binaries:

SARS-CoV-2-NGS-freebayes-nolimit
SARS-CoV-2-GISAID-freebayes-nolimit
SARS-CoV-2-FASTA-freebayes-nolimit ("Agnostic" FASTA variant caller)
SARS-CoV-2-processing-fasta-nolimit
SARS-CoV-2-merge-variants-nolimit

For a proper performance working with high number of genomes (i.e. > 60000), use these binaries:

SARS-CoV-2-NGS-freebayes
SARS-CoV-2-GISAID-freebayes
SARS-CoV-2-FASTA-freebayes ("Agnostic" FASTA variant caller)
SARS-CoV-2-processing-fasta
SARS-CoV-2-merge-variants

The latter binaries asume you can set these values:

ulimit -n 1000000    # Check your machine with: ulimit -n 
ulimit -s 1000000    # Check your machine with: ulimit -s

Check README_ulimit for how to change these values in Ubuntu.

Task Scenarios:

I) Colecting Variants (Sequence Read Archive datasets) with SARS-CoV-2-NGS-freebayes

In order to obtain SARS-CoV-2 variants (viral frequency >= 0.5) users need to use SARS-CoV-2-NGS-freebayes and provide:

Execution

Assuming that binaries are in /usr/local/bin, users can do the following:

samtools faidx covid19-refseq.fasta
SARS-CoV-2-NGS-freebayes -l <SRA_list> -g <covid19-refseq.fasta> -a <VF> -t <Threads>

This execution will:

Example

We provided a curated list of 17560 SARS-CoV-2 worldwide datasets until July 28, 2020 (see SRA_Accessions_Jul_28_2020.tabular). We also provided curated lists in txt format by continent (see July_282020*.txt files). As an example, we will collect variants from July_28_2020_North_America.txt datasets using 30 threads. In Ubuntu it is recommended to increase open file limit and stack size accordingly at the number of genomes to process, otherwise these steps may crush. see README_ulimit for details in Ubuntu. In a given folder:

ulimit -n 1000000 && ulimit -s 1000000    # check if you can increase stack size and open file limit, see README_ulimit for details.
# Execute the pipeline, providing full path to July_28_2020_North_America.txt and covid19-refseq.fasta files:
SARS-CoV-2-NGS-freebayes -l July_28_2020_North_America.txt -g covid19-refseq.fasta -a 0.4999 -t 30

will collect variants (VF>=0.5) in each Sample.

II) Collecting Variants (GISAID FASTA genomes) with SARS-CoV-2-GISAID-freebayes

Execution

In order to obtain SARS-CoV-2 variants from FASTA GISAID genomes, users need to use SARS-CoV-2-GISAID-freebayes and provide:

Assuming that binaries are in /usr/local/bin, do the following:

samtools faidx /full/path/to/covid19-refseq.fasta
# Execute the pipeline, providing full path to merged.GISAID.fasta and covid19-refseq.fasta sequences:
ulimit -n 1000000 && ulimit -s 1000000  # check if you can increase stack size and open file limit, see README_ulimit for details.
SARS-CoV-2-GISAID-freebayes -f /full/path/to/merged.GISAID.fasta -g /full/path/to/covid19-refseq.fasta -t 10 

To collect variants from non GISAID FASTA genomes (or FASTA genomes with header different as FASTA genomes from GISAID), use SARS-CoV-2-FASTA-freebayes binary, as follows:

samtools faidx /full/path/to/covid19-refseq.fasta
# Execute the pipeline, providing full path to merged.GISAID.fasta and covid19-refseq.fasta sequences:
ulimit -n 1000000 && ulimit -s 1000000  # check if you can increase stack size and open file limit, see README_ulimit for details.
SARS-CoV-2-FASTA-freebayes -f /full/path/to/merged.fasta -g /full/path/to/covid19-refseq.fasta -t 10 

in which merged.fasta file contains all sequences in a single file. To produce this file, users can place in a folder all FASTA genomes to analyze and do: cat *.fasta > merged.fasta

III) Nucleotide diversity and Tajima's D test calculation with pi-tajima

Assuming users previously runned samtools faidx /full/path/to/covid19-refseq.fasta and binaries are in /usr/local/bin, do:

ulimit -n 1000000 && ulimit -s 1000000                                                       # check if you can set these values
SARS-CoV-2-GISAID-freebayes -f merged.GISAID.fasta -g covid19-refseq.fasta -t 10             # call variants, provide full path to files
vcftools --vcf merged.GISAID.AF.vcf --window-pi 50 --haploid --out merged.50                 # 50 bp sliding window
vcftools --vcf merged.GISAID.AF.vcf --TajimaD 50 --haploid --out merged.50                   # 50 bp sliding window

# joining pi with Tajima's D values and calculate Tajima's D confidence intervals
awk '{print $1"\t"$3"\t"$4"\t"$5}' merged.50.windowed.pi > merged.50.subset.windowed.pi
join -j 2 <(sort -k2 merged.50.subset.windowed.pi) <(sort -k2 merged.50.Tajima.D) -o 1.2,1.4,2.4> joined && sed -i 's/ /\t/'g joined
sort -k1 -n joined > merged.50.pi.D && rm joined merged.50.subset.windowed.pi
pi-tajima -f merged.50.pi.D -v merged.GISAID.AF.vcf                                          # Run pi-tajima               

The file "merged.50.pi.D" contains binned π and Tajima's D values and can be plotted in any sofware.

check "postprocessing_pi_D_output_files" subdirectory, containing:

IV) Annotate and collect variants per protein on SnpEff-classified VCF variants

mkdir SnpEff-SARS-CoV-2
cd SnpEff-SARS-CoV-2
SnpEff_processing -v SnpEff-Nov-30-2020.GISAID.vcf    

In each folder, variants_per_protein subfolder contain variants per protein. Files ending in ".SnpEff" contains parsed variants per consequence and ".counts" contains associated counts, respectively. Also, the script computed aminoacid changes (see SnpEff.AAchanges files). We suggest user-provided vcf files should be processed in a likewise manner, placing the vcf file in a specific folder and executing steps 3) and 4).

V) More than 100000 GISAID sequences to analyze?

please visit our wiki page here: https://github.com/cfarkas/SARS-CoV-2-freebayes/wiki#i-two-step-full-pipeline