Open Raniakhalil opened 4 years ago
gunzip homo_sapiens.vcf.gz grep "^#" homo_sapiens.vcf > homo_sapiens_chr18.vcf grep "^18" homo_sapiens.vcf | sed 's/^18/chr18/' >> homo_sapiens_chr18.vcf gatk IndexFeatureFile -I homo_sapiens_chr18.vcf
weget https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10960190 Standard fastq format for paired end reads, with reads 1 and 2 of each pair at the same position in 2 separate files for each DNA source awk '{if(NR%8 >= 1 && NR%8 < 5) print}' file.fastq > file_R1.fastq awk '{if(NR%8 == 0 || NR%8 >= 5) print}' file.fastq > file_R2.fastq B. Downloads programs:
BWA alignment: https://github.com/drtamermansour/nu-ngs01/blob/master/Day-3/seq_alignment.md 1.1 install bwa (Windows) conda activate ngs1 conda install -c bioconda -y bwa and Sudo apt install bwa (Ubuntu) 1.2 index your genome mkdir -p ~/workdir/bwa_align/bwaIndex && cd ~/workdir/bwa_align/bwaIndex ln -s ~/workdir/refseqchrom18.fa . bwa index -a bwtsw refseqchrom18.fa 1.3 sequence alignment cd ~/workdir/bwa_align R1="$HOME/workdir/file_R1.fq" R2="$HOME/workdir/fqData/file_R2.fq" /usr/bin/time -v bwa mem bwaIndex/refseqchrom18.fa $R1 $R2 > ngs_project.sam
GATK: https://github.com/drtamermansour/nu-ngs02/blob/master/Day2/gatk.md conda install -c bioconda gatk4
Create and go to install directory & Decompress | Time | |
---|---|---|
cd $HOME/tools/sra-tools/ | ||
Time curl https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6/sratoolkit.2.9.6-ubuntu64.tar.gz -o sratoolkit.2.9.6-ubuntu64.tar.gz | Real 38m47.657s, user 0m1.964s, sys 0m5.673s | |
time tar -xvzf sratoolkit.2.9.6-ubuntu64.tar.gz | real 0m7.433s, user 0m4.381s, sys 0m1.362s | |
Check directory name (sra-tools version) | ||
ls sratoolkit.2.9.6-ubuntu64/ sratoolkit.current-ubuntu64.tar.gz |
Add location to system PATH (using current version directory name) & check installation and fastq-dump export PATH=$HOME/tools/sra-tools/sratoolkit.2.9.6-ubuntu64/bin:$PATH | | fastq-dump –help | | fastq-dump -V Result: fastq-dump : 2.9.6 | | | Downloading samples_Data Loading, counting & Unzip samples_data | | Mkdir -p ~/project/fqData && cd ~/project/fqData | | time wget –c https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRZ/010960/SRR10960190/DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz | | real 0m30.238s, user 0m0.007s, sys 0m0.003s
time echo $(zcat DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz\ | wc -l)/4\ | bc Result: 4000000 | real 0m7.249s, user 0m6.758s, sys 0m0.313s | |
---|---|---|---|---|
Time Gunzip DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz |
wget -c https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRZ/010960/SRR10960190/DB_DS05-2-19_GTGAAA_L002_R2_001.fastq.gz | real 0m14.358s, user 0m0.000s, sys 0m0.013s | |
---|---|---|
time echo $(zcat DB_DS05-2-19_GTGAAA_L002_R1_009.fastq.gz|wc -l)/4|bc Result: 4000000 | ||
time Gunzip DB_DS05-2-19_GTGAAA_L002_R2_001.fastq.gz |
Download Reference Genome download hg19 chromosome fasta files | | wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz | |
unzip and concatenate chromosome and contig fasta files | ||
---|---|---|
tar zvfx chromFa.tar.gz | ||
cat chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa > 5_chr.fa | ||
rm chr*.fa | ||
grep -v ">" 5_chr.fa | wc | awk '{print $3-$1}' Result: 0373285 | real 0m0.890s, user 0m0.832s, sys 0m0.031s |
Install BWA conda activate ngs1 conda install -c bioconda bwa | bwa index -p 4_chrbwaidx -a bwtsw 4_chr.fa | bwtsw for long genomes -p index name -a index algorithm
aligning paired end reads bwa aln -t 4 5_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R1_009.fastq > DB_DS05-2-19_GTGAAA_L002_R1_009.sai | real 21m22.515s user 19m7.939s sys 0m13.679s bwa aln -t 4 5_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R2_001.fastq > DB_DS05-2-19_GTGAAA_L002_R2_001.sai | real 18m23.712s user 17m19.019s sys 0m5.856s
bwa sampe 5_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R1_009.sai DB_DS05-2-19_GTGAAA_L002_R2_001.sai DB_DS05-2-19_GTGAAA_L002_R1_009.fastq DB_DS05-2-19_GTGAAA_L002_R2_001.fastq > DB_DS05-2-19_GTGAAA_L002_pe.sam Results: [bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize] fail to infer insert size: too few good pairs [bwa_sai2sam_pe_core] time elapses: 1.70 sec [bwa_sai2sam_pe_core] changing coordinates of 0 alignments. [bwa_sai2sam_pe_core] align unmapped mate... [bwa_sai2sam_pe_core] time elapses: 0.05 sec [bwa_sai2sam_pe_core] refine gapped alignments... 0.18 sec [bwa_sai2sam_pe_core] print alignments... [bwa_sai2sam_pe_core] paired reads have different names: "HWI-ST1148:187:C48MFACXX:2:2214:1850:4532", "HWI-ST1148:187:C48MFACXX:2:1101:1465:1992" |
---|
Generate and sorted BAM files for samfile in *.sam;do sample=${samfile%.sam} samtools view -hbo $sample.bam $samfile samtools sort $sample.bam -o $sample.sorted.bam done samtools view -H DS05-2-19_GTGAAA_L002.sorted.bam Result: @HD VN:1.6 SO:coordinate @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa sampe 4_chrbwaidx DB_DS05-2-19_GTGAAA_L002_R1_009.sai DB_DS05-2-19_GTGAAA_L002_R2_001.sai DB_DS05-2-19_GTGAAA_L002_R1_009.fastq DB_DS05-2-19_GTGAAA_L002_R2_001.fastq |
or bamFile in .sorted.bam;do output=${bamFile%.sorted.bam} samtools depth $bamFile | awk '{{sum+=$3}} END {{print "Average = ",sum/NR, "No of covered Nuc = ", NR}}' > $output.cov samtools flagstat $bamFile > $output.stat done Resuls: 2 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 mapped (0.00% : N/A) 2 + 0 paired in sequencing 1 + 0 read1 1 + 0 read2 0 + 0 properly paired (0.00% : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Mark duplicate for sample in .sorted.bam;do name=${sample%.sorted.bam} java -Xmx2g -jar $picard_path/picard.jar MarkDuplicates INPUT=$sample OUTPUT=$name.dedup.bam METRICS_FILE=$name.metrics.txt; done
Install GATK | ||
---|---|---|
conda install -c bioconda gatk4 | ||
Indexing | ||
# samples for sample in *.dedup.bam;do #name=${sample%.dedup.bam} java -Xmx2g -jar $picard_path/picard.jar BuildBamIndex VALIDATION_STRINGENCY=LENIENT INPUT=$sample done | ||
# Reference java -Xmx2g -jar $picard_path/picard.jar CreateSequenceDictionary R=5_chr.fa O=5_chr.dict | ||
Download known variants | ||
Download known polymorphic sites | ||
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/common_all.vcf.gz -O common_all.vcf.gz | ||
Time gunzip common_all.vcf.gz | real 3m20.737s user 1m16.600s sys 0m15.169s | |
gatk IndexFeatureFile -I common_all.vcf | ||
Recalibrate Bases BQSR | ||
for sample in *.dedup.bam;do name=${sample%.dedup.bam} gatk --java-options "-Xmx2G" BaseRecalibrator \ -R 5_chr.fa -I $sample --known-sites common_all.vcf \ -O $ name.report | A USER ERROR has occurred: Number of read groups must be >= 1, but is 0 | |
gatk --java-options "-Xmx2G" ApplyBQSR \ -R 4_chr.fa -I $sample -bqsr $name.report \ -O $name.bqsr.bam --add-output-sam-program-record --emit-original-quals done |
Joint variant calling using HaplotypeCaller
VCF statistics | |
conda install -c bioconda rtg-tools rtg vcfstats DS05-2-19_GTGAAA_L002_ann.vcf.gz > stats.txt | | Split SNPs and indels | | gatk --java-options "-Xmx2G" SelectVariants \ -R 5_chr.fa \ -V DS05-2-19_GTGAAA_L002_ann.vcf \ --select-type-to-include SNP \ -O DS05-2-19_GTGAAA_L002_ann_SNP.vcf gatk --java-options "-Xmx2G" SelectVariants \ -R 4_chr.fa \ -V DS05-2-19_GTGAAA_L002_ann.vcf \ --select-type-to-include INDEL \ -O DS05-2-19_GTGAAA_L002_ann_INDEL.vcf | | Results | |
for f in SNP. INDEL.;do Rscript densityCurves.R "$f" done
Trouble shootings
1. We get the permission denied error, by googling it we get a response of that may we need to set up the environment correctly so that can call the proper versions of fortran and c++.
2. By e-Mailing Paul Vanraden the developer of the tool we get these responses:
Vanraden, Paul paul.vanraden@usda.gov To:rania742002@yahoo.com Wed, Feb 19 at 4:21 PM Rania GATK starts with data that has already been aligned to a map such as BAM files from BWA. Findmap starts directly with fastq files and does the alignment and can call any known variants at the same time. The runsim.script generates example files and runmap.script runs the programs. Or perhaps the problem may be that executable files are not running on your system and then you may need local help to recompile the programs there. Paul
Vanraden, Paul paul.vanraden@usda.gov To:Rania khalil Thu, Feb 20 at 3:42 PM Rania The online instructions are the only ones available: https://aipl.arsusda.gov/software/findmap/ To understand the flow you should simulate practice files using the runsim.script and then test the alignment and variant calling using the runmap.script. The command lines are in there and it requires standard names for the input files. We developed the software using ifort compiler on Linux and have never tried other compilers or windows OS. Paul
N.B. No difference response After alignment by BWA as in section B.1
4. Freebayes tool
• installation of freebayes
wget http://clavius.bc.edu/~erik/freebayes/freebayes5d5b8ac0.tar.gz tar xzvf freebayes-5d5b8ac0.tar.gz
• Command of variant calling
freebayes -f refseqchrom18.fasta --ploidy 1 sra_data.bam > variants.vcf
Time calculating for each step:
-- | --
Variant calling by Freebayes | real 0m0.162s user 0m0.015s syn 0m0.005s
Downloads input files:
Known variant file
[wget ftp://ftp.ensembl.org/pub/release-99/variation/gvf/homo_sapiens/homo_sapiens-chr18.gvf.gz' -O homo_sapiens.vcf.gz](url)
Select variants on chr18 and correct chr name