Raniakhalil / Rania

0 stars 0 forks source link

Benchmarking variant identification tools for colorectal tumor: Comparative study between two variant calling programs #1

Open Raniakhalil opened 4 years ago

Raniakhalil commented 4 years ago

Downloads input files:

Known variant file

[wget ftp://ftp.ensembl.org/pub/release-99/variation/gvf/homo_sapiens/homo_sapiens-chr18.gvf.gz' -O homo_sapiens.vcf.gz](url)

Select variants on chr18 and correct chr name

Raniakhalil commented 4 years ago

gunzip homo_sapiens.vcf.gz grep "^#" homo_sapiens.vcf > homo_sapiens_chr18.vcf grep "^18" homo_sapiens.vcf | sed 's/^18/chr18/' >> homo_sapiens_chr18.vcf gatk IndexFeatureFile -I homo_sapiens_chr18.vcf

fastq files:

weget https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10960190 Standard fastq format for paired end reads, with reads 1 and 2 of each pair at the same position in 2 separate files for each DNA source awk '{if(NR%8 >= 1 && NR%8 < 5) print}' file.fastq > file_R1.fastq awk '{if(NR%8 == 0 || NR%8 >= 5) print}' file.fastq > file_R2.fastq B. Downloads programs:

  1. BWA alignment: https://github.com/drtamermansour/nu-ngs01/blob/master/Day-3/seq_alignment.md 1.1 install bwa (Windows) conda activate ngs1 conda install -c bioconda -y bwa and Sudo apt install bwa (Ubuntu) 1.2 index your genome mkdir -p ~/workdir/bwa_align/bwaIndex && cd ~/workdir/bwa_align/bwaIndex ln -s ~/workdir/refseqchrom18.fa . bwa index -a bwtsw refseqchrom18.fa 1.3 sequence alignment cd ~/workdir/bwa_align R1="$HOME/workdir/file_R1.fq" R2="$HOME/workdir/fqData/file_R2.fq" /usr/bin/time -v bwa mem bwaIndex/refseqchrom18.fa $R1 $R2 > ngs_project.sam

  2. GATK: https://github.com/drtamermansour/nu-ngs02/blob/master/Day2/gatk.md conda install -c bioconda gatk4

Create and go to install directory & Decompress   Time
cd $HOME/tools/sra-tools/    
Time curl https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6/sratoolkit.2.9.6-ubuntu64.tar.gz -o sratoolkit.2.9.6-ubuntu64.tar.gz   Real 38m47.657s, user 0m1.964s, sys 0m5.673s
time tar -xvzf sratoolkit.2.9.6-ubuntu64.tar.gz   real 0m7.433s, user 0m4.381s, sys 0m1.362s
Check directory name (sra-tools version)    
ls sratoolkit.2.9.6-ubuntu64/ sratoolkit.current-ubuntu64.tar.gz    

Add location to system PATH (using current version directory name) & check installation and fastq-dump export PATH=$HOME/tools/sra-tools/sratoolkit.2.9.6-ubuntu64/bin:$PATH |   |   fastq-dump –help |   |   fastq-dump -V Result: fastq-dump : 2.9.6 |   |     | Downloading samples_Data Loading, counting & Unzip samples_data |   |   Mkdir -p ~/project/fqData && cd ~/project/fqData |   |   time wget –c https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRZ/010960/SRR10960190/DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz |   | real 0m30.238s, user 0m0.007s, sys 0m0.003s

time echo $(zcat DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz\ wc -l)/4\ bc Result: 4000000   real 0m7.249s, user 0m6.758s, sys 0m0.313s
Time Gunzip DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz  
wget -c https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRZ/010960/SRR10960190/DB_DS05-2-19_GTGAAA_L002_R2_001.fastq.gz   real 0m14.358s, user 0m0.000s, sys 0m0.013s
time echo $(zcat DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz|wc -l)/4|bc Result: 4000000    
time Gunzip DB_DS05-2-19_GTGAAA_L002_R2_001.fastq.gz  

Download Reference Genome download hg19 chromosome fasta files |   |   wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz |   |  

unzip and concatenate chromosome and contig fasta files  
tar zvfx chromFa.tar.gz    
cat chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa > 5_chr.fa    
rm chr*.fa    
grep -v ">" 5_chr.fa | wc | awk '{print $3-$1}' Result: 0373285   real 0m0.890s, user 0m0.832s, sys 0m0.031s

Create Reference Index

Install BWA conda activate ngs1 conda install -c bioconda bwa |   bwa index -p 4_chrbwaidx -a bwtsw 4_chr.fa |   bwtsw for long genomes -p index name -a index algorithm

Align to Reference Genome

aligning paired end reads bwa aln -t 4 5_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R1_009.fastq > DB_DS05-2-19_GTGAAA_L002_R1_009.sai | real 21m22.515s user 19m7.939s sys 0m13.679s bwa aln -t 4 5_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R2_001.fastq > DB_DS05-2-19_GTGAAA_L002_R2_001.sai | real 18m23.712s user 17m19.019s sys 0m5.856s

bwa sampe 5_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R1_009.sai DB_DS05-2-19_GTGAAA_L002_R2_001.sai DB_DS05-2-19_GTGAAA_L002_R1_009.fastq DB_DS05-2-19_GTGAAA_L002_R2_001.fastq > DB_DS05-2-19_GTGAAA_L002_pe.sam Results: [bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize] fail to infer insert size: too few good pairs [bwa_sai2sam_pe_core] time elapses: 1.70 sec [bwa_sai2sam_pe_core] changing coordinates of 0 alignments. [bwa_sai2sam_pe_core] align unmapped mate... [bwa_sai2sam_pe_core] time elapses: 0.05 sec [bwa_sai2sam_pe_core] refine gapped alignments... 0.18 sec [bwa_sai2sam_pe_core] print alignments... [bwa_sai2sam_pe_core] paired reads have different names: "HWI-ST1148:187:C48MFACXX:2:2214:1850:4532", "HWI-ST1148:187:C48MFACXX:2:1101:1465:1992"  

Generate and sorted BAM files for samfile in *.sam;do sample=${samfile%.sam} samtools view -hbo $sample.bam $samfile samtools sort $sample.bam -o $sample.sorted.bam done samtools view -H DS05-2-19_GTGAAA_L002.sorted.bam Result: @HD VN:1.6 SO:coordinate @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa sampe 4_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R1_009.sai DB_DS05-2-19_GTGAAA_L002_R2_001.sai DB_DS05-2-19_GTGAAA_L002_R1_009.fastq DB_DS05-2-19_GTGAAA_L002_R2_001.fastq |  

Mapping QC

or bamFile in .sorted.bam;do output=${bamFile%.sorted.bam} samtools depth $bamFile | awk '{{sum+=$3}} END {{print "Average = ",sum/NR, "No of covered Nuc = ", NR}}' > $output.cov samtools flagstat $bamFile > $output.stat done Resuls: 2 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 mapped (0.00% : N/A) 2 + 0 paired in sequencing 1 + 0 read1 1 + 0 read2 0 + 0 properly paired (0.00% : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Mark duplicate for sample in .sorted.bam;do name=${sample%.sorted.bam} java -Xmx2g -jar $picard_path/picard.jar MarkDuplicates INPUT=$sample OUTPUT=$name.dedup.bam METRICS_FILE=$name.metrics.txt; done

Install GATK    
conda install -c bioconda gatk4    
Indexing    
# samples for sample in *.dedup.bam;do #name=${sample%.dedup.bam} java -Xmx2g -jar $picard_path/picard.jar BuildBamIndex VALIDATION_STRINGENCY=LENIENT INPUT=$sample done    
# Reference java -Xmx2g -jar $picard_path/picard.jar CreateSequenceDictionary R=5_chr.fa O=5_chr.dict    
Download known variants    
Download known polymorphic sites    
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/common_all.vcf.gz -O common_all.vcf.gz    
Time gunzip common_all.vcf.gz   real 3m20.737s user 1m16.600s sys 0m15.169s
gatk IndexFeatureFile -I common_all.vcf    
Recalibrate Bases BQSR    
for sample in *.dedup.bam;do name=${sample%.dedup.bam} gatk --java-options "-Xmx2G" BaseRecalibrator \ -R 5_chr.fa -I $sample --known-sites common_all.vcf \ -O $ name.report   A USER ERROR has occurred: Number of read groups must be >= 1, but is 0
gatk --java-options "-Xmx2G" ApplyBQSR \ -R 4_chr.fa -I $sample -bqsr $name.report \ -O $name.bqsr.bam --add-output-sam-program-record --emit-original-quals done    

Joint variant calling using HaplotypeCaller

assess genotype likelihood per-sample for sample in *.bqsr.bam;do name=${sample%.bqsr.bam} gatk --java-options "-Xmx2G" HaplotypeCaller \ -R 4_chr.fa -I $sample \ --emit-ref-confidence GVCF \ --pcr-indel-model NONE \ -O $ DS05-2-19_GTGAAA_L002.gvcf Done ## Joint Genotyping gatk --java-options "-Xmx60G" GenotypeGVCFs \ -R 4_chr.fa \ -V DS05-2-19_GTGAAA_L002.gvcf \ --max-alternate-alleles 6 \ -O DS05-2-19_GTGAAA_L002.vcf |   | A USER ERROR has occurred: Argument emit-ref-confidence has a bad value: Can only be used in single sample mode currently. Use the --sample-name argument to run on a single sample out of a multi-sample BAM file

annotated output gatk --java-options "-Xmx60G" GenotypeGVCFs \ -R 5_chr.fa \ -V DS05-2-19_GTGAAA_L002.gvcf \ --max-alternate-alleles 6 \ --dbsnp common_all.vcf \ -O DS05-2-19_GTGAAA_L002_ann.vcf |   |  

check how many variant got annotated grep -v "^#" raw_vari DS05-2-19_GTGAAA_L002ants_ann.vcf | awk '{print $3}' | grep "^rs" | wc -l |   |  

VCF statistics |   |  

index the VCF file conda install -c bioconda tabix bgzip -c DS05-2-19_GTGAAA_L002_ann.vcf > DS05-2-19_GTGAAA_L002_ann.vcf.gz tabix -p vcf DS05-2-19_GTGAAA_L002_ann.vcf.gz |   |  

conda install -c bioconda rtg-tools rtg vcfstats DS05-2-19_GTGAAA_L002_ann.vcf.gz > stats.txt |   |   Split SNPs and indels |   |   gatk --java-options "-Xmx2G" SelectVariants \ -R 5_chr.fa \ -V DS05-2-19_GTGAAA_L002_ann.vcf \ --select-type-to-include SNP \ -O DS05-2-19_GTGAAA_L002_ann_SNP.vcf gatk --java-options "-Xmx2G" SelectVariants \ -R 4_chr.fa \ -V DS05-2-19_GTGAAA_L002_ann.vcf \ --select-type-to-include INDEL \ -O DS05-2-19_GTGAAA_L002_ann_INDEL.vcf |   |   Results |   |  

Figures wget https://raw.githubusercontent.com/dib-lab/dogSeq/master/scripts/densityCurves.R sudo Rscript -e "install.packages('ggplot2', contriburl=contrib.url('http://cran.r-project.org/'))" for f in SNP. INDEL.;do Rscript densityCurves.R "$f" done |   |  

for f in SNP. INDEL.;do Rscript densityCurves.R "$f" done

  1. Findmap version 2.2: https://aipl.arsusda.gov/software/findmap/ • Download findmapV2.2.zip onto a computer with the Unix operating system. • unzip findmapV2.2.zip. • Type runsim.script (run the program series script and generate reference map, variant file, fastq files). • Type runmap.script (run the alignment and variant calling programs). • maxhash is set to 87654321 and maxdup to 10 million in storemap.options ( reduce memory ) but values of 287654321 and 300 million are recommended for a 3-Gbase genome. • All programs in this series treat lower and uppercase as the same because storemap identifies, counts, stores, and links repeated k-mers to each other while hashing the reference file.
Raniakhalil commented 4 years ago

Trouble shootings

1. We get the permission denied error, by googling it we get a response of that may we need to set up the environment correctly so that can call the proper versions of fortran and c++.
2. By e-Mailing Paul Vanraden the developer of the tool we get these responses:

Vanraden, Paul paul.vanraden@usda.gov To:rania742002@yahoo.com Wed, Feb 19 at 4:21 PM Rania GATK starts with data that has already been aligned to a map such as BAM files from BWA. Findmap starts directly with fastq files and does the alignment and can call any known variants at the same time. The runsim.script generates example files and runmap.script runs the programs. Or perhaps the problem may be that executable files are not running on your system and then you may need local help to recompile the programs there. Paul

Vanraden, Paul paul.vanraden@usda.gov To:Rania khalil Thu, Feb 20 at 3:42 PM Rania The online instructions are the only ones available: https://aipl.arsusda.gov/software/findmap/ To understand the flow you should simulate practice files using the runsim.script and then test the alignment and variant calling using the runmap.script. The command lines are in there and it requires standard names for the input files. We developed the software using ifort compiler on Linux and have never tried other compilers or windows OS. Paul

  1. I am trying to sync FORTRAN for my analysis. N.B. The Intel Fortran compiler is available as part of the Intel Parallel Studio XE 2016 suite, which focuses on development of parallelism models in application software. It also includes Intel C++, Intel Math Kernel Library, Intel Integrated Performance Primitives, Intel Data Analytics Acceleration Library and performance analysis tools such as Intel VTune Amplifier and Intel Inspector. There are three forms of Parallel Studio XE: Composer, Professional, and Cluster. The Composer Edition includes the C++ and/or Fortran compilers, the performance libraries, and parallel models support. The Professional Edition adds the analysis tools that assist in debugging and tuning parallel applications. The Cluster Edition adds support for development of software for computer clusters. It includes all of the above plus a standards-based MPI Library, MPI communications profiling and analysis tool, MPI error checking and tuning tools, and cluster checker.

N.B. No difference response After alignment by BWA as in section B.1

4. Freebayes tool 
• installation of freebayes

wget http://clavius.bc.edu/~erik/freebayes/freebayes5d5b8ac0.tar.gz tar xzvf freebayes-5d5b8ac0.tar.gz

• Command of variant calling

freebayes -f refseqchrom18.fasta --ploidy 1 sra_data.bam > variants.vcf

Time calculating for each step:

Index reference file | sys 0m0.004s real 0m0.162s user 0m0.011s

-- | --

Subsetting to 5000 reads 1.time seqtk sample -s100 sra_data_R1.fastq 5000 > R1.fastq | real 0m2.886s user 0m2.480s sys 0m0.379s

  1. time seqtk sample -s100 sra_data_R2.fastq 5000 > R2.fastq | real 0m2.902s user 0m2.462s sys 0m0.374s

    Install bwa # index time bwa index -a bwtsw refseqchrom18.fasta | real 1m56.415s user 1m48.815s sys 0m0.803s

    Alignment | real 0m0.047s user 0m0.003s sys 0m0.005s

    Variant calling by Freebayes | real 0m0.162s user 0m0.015s syn 0m0.005s