bensassonlab / labprotocols

Repository of lab protocols used by the Bensasson Lab
3 stars 3 forks source link

Protocol for Illumina read mapping and generating consensus #2

Open dbensasson opened 4 years ago

dbensasson commented 4 years ago

This is a general lab protocol for Illumina read mapping and generating a consensus.

Overall approach

Depending on your starting files you might skip steps 1 and 2.

  1. Get a reference strain genome sequence with annotations (sacCer3 for S. cerevisiae), store in home and scratch and check
  2. copy read data to scratch, unpack and run checks
  3. read map with BWA and check
  4. use samtools to generate a consensus and check
  5. generate fasta format sequence and check

In the examples below I switch between using S. cerevisiae and C. albicans as my example datasets. You need to adapt these examples to your own data.

dbensasson commented 4 years ago

1. Get a reference strain with annotations (sacCer3)

UCSC maintains a good static version of the Saccharomyces cerevisiae S288c reference genome that does not change (sacCer3). This is great because it means analyses are easy to replicate using this reference genome. It also has unchanging associated annotations that therefore can be mapped to all genomes that are mapped against this genome. And because S288c was the first eukaryote ever sequenced in 1996 its sequence and annotations are exceptionally complete and accurate. You can download it from here and they give very clear instructions on how to download.

[doudab@xfer1 douda]$ echo rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/ . > do_get_sacCer3
[doudab@xfer1 douda]$ chmod +x do_get_sacCer3 
[doudab@xfer1 douda]$ ./do_get_sacCer3 &

Check that for all files where an md5sum is available the transfer has gone ok (especially chromFa.tar.gz):

md5sum -c md5sum.txt 

In the above command, with * you are generating a md5 "hash" or id for every file in the directory and then with -c you compare that to the md5sums listed in md5sum.txt. A simpler alternative is to generate an md5sum for each file e.g. md5sum DB_D_10_CGAGGCTG-TATCCTCT_L007_R1_001.fastq.gz.

In a Mac, the command is md5 not md5sum.

The DNA sequence data to use for read mapping is in chromFa.tar.gz. You will use this regularly and it is not big, so move it to your home directory on sapelo. Each chromosome is in separate file, including the mtDNA (chrM.fa). Unpack all DNA sequence:

tar zxvf chromFa.tar.gz

Concatenate the chromosomes into a single file and rename each chromosome so it is clear this is sacCer3:

cat chr*.fa > sacCer3.mfa
cat sacCer3.mfa | sed 's/chr/saCer3chr/g' > temp
grep ">" temp 
mv temp sacCer3.mfa

In the future, sacCer3.mfa will be the file you will most commonly use as a reference genome.

2. copy read data to scratch, unpack and run checks

Copy read data to scratch (this is not the most elegant way to do it!):

screen
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_1* . > do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_2_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_20_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_21_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_3_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_4_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_5_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_6_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_7_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_8_* . >> do_getilluminareads
[doudab@xfer1 rotation1]$ echo rsync -avp /project/dblab/HISEQ_150317/DB_D_9_* . >> do_getilluminareads
chmod +x do_getilluminareads
./do_getilluminareads &

Run checks!

It is a good idea to use the md5sum command on both sides to check that files come through correctly. Please see the above comment for an example of how to use a md5sum file to check that files came through correctly.

You can also check that files are the correct size at the receiving end (using ls -l). Or use diff to check whether the content of files differ.

dbensasson commented 4 years ago

Steps 1 and 2 above applied to C. albicans, then steps 3 and 4:

Mapping Candida albicans reads to the SC5314_A22 reference on sapelo2

NOTE: This example is for verifying the identity of strains, checking ploidy, CNVs, and Loss of Heterozygosity. I do not go on to generate a fasta format consensus sequence for alignment and analysis of SNPS for multiple strains that are aligned to the same reference. Therefore, I do not use the -I (--skip-indels) option with bcftools mpileup in step 4 below and have not verified whether seqtk will work (see above) to generate a fasta format quality-filtered consensus genome sequence.

1. Download the reference strain and check

SC5314 is the reference genome for C. albicans A22 = haplotype A version 22

From my notes in September 2017: https://github.com/bensassonlab/douda/issues/80#issuecomment-327791086

echo  wget -r "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Candida_albicans/latest_assembly_versions/GCF_000182965.3_ASM18296v3/" > do_getSC5314_A22

Rename each chromosome so I can recognize it and distinguish it from other reference sequences:

[doudab@xfer1 references]$ zcat ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Candida_albicans/latest_assembly_versions/GCF_000182965.3_ASM18296v3/GCF_000182965.3_ASM18296v3_genomic.fna.gz | sed 's/>\(.*\)chromosome \(.\)/>SC5314A22chr\2 \1/g' > SC5314A22.mfa

Then check all is well:

[doudab@xfer1 references]$ fastaLength.pl -i SC5314A22.mfa 

Found 8 sequences in SC5314A22.mfa.

SC5314A22chr1 NC_032089.1 Candida albicans SC5314  sequence 3188341 bp
SC5314A22chr2 NC_032090.1 Candida albicans SC5314  sequence 2231883 bp
SC5314A22chr3 NC_032091.1 Candida albicans SC5314  sequence 1799298 bp
SC5314A22chr4 NC_032092.1 Candida albicans SC5314  sequence 1603259 bp
SC5314A22chr5 NC_032093.1 Candida albicans SC5314  sequence 1190845 bp
SC5314A22chr6 NC_032094.1 Candida albicans SC5314  sequence 1033292 bp
SC5314A22chr7 NC_032095.1 Candida albicans SC5314  sequence 949511 bp
SC5314A22chrR NC_032096.1 Candida albicans SC5314  sequence 2286237 bp

Search NCBI for NC_032089.1 and NC_032096.1 and check that they are indeed haplotype A. For example, for NC_032096.1:

Screen Shot 2019-06-02 at 9 02 15 AM

2. Download Illumina reads and check

Search for the accession and link for paired reads on EBI's SRA. See here for further instructions: https://github.com/bensassonlab/labprivate/issues/8

download using aspera SC5314_WT which is more reliable than wget (see https://github.com/bensassonlab/labprivate/issues/8) for aspera instructions:

[doudab@xfer1 SC5314_WT]$ echo ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR850/SRR850113/SRR850113_1.fastq.gz . > do_getSC5314WTreads
[doudab@xfer1 SC5314_WT]$ echo ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR850/SRR850113/SRR850113_2.fastq.gz . >> do_getSC5314WTreads
[doudab@xfer1 SC5314_WT]$ chmod +x do_getSC5314WTreads 
[doudab@xfer1 SC5314_WT]$ ./do_getSC5314WTreads 

Ran the scripts to get paired read files from EBI's SRA in screen on xfer.

The SC5314_WT Illumina read files are identical to the ones I downloaded for Bensasson et al 2019. These are short Illumina read data for the wildtype diploid version of SC5314 from the phasing paper by Muzzey et al (my usual positive control):

[doudab@xfer1 ~]$ diff /scratch/doudab/shared/candidaSequel/map2agp/illuminareads/SC5314_WT/SRR850113_1.fastq.gz /project/dblab/candidaDB/SC5314/sra/SRR850113_1.fastq.gz 
[doudab@xfer1 ~]$ diff /scratch/doudab/shared/candidaSequel/map2agp/illuminareads/SC5314_WT/SRR850113_1.fastq.gz /project/dblab/candidaDB/SC5314/sra/SRR850113_2.fastq.gz 
Binary files /scratch/doudab/shared/candidaSequel/map2agp/illuminareads/SC5314_WT/SRR850113_1.fastq.gz and /project/dblab/candidaDB/SC5314/sra/SRR850113_2.fastq.gz differ
[doudab@xfer1 ~]$ diff /scratch/doudab/shared/candidaSequel/map2agp/illuminareads/SC5314_WT/SRR850113_2.fastq.gz /project/dblab/candidaDB/SC5314/sra/SRR850113_2.fastq.gz 

3. Write a shell script to index the reference genome and run

doudab@n563 $ cat runindexrefs

#PBS -S /bin/bash
#PBS -N runindexrefs
#PBS -q bensasson_q
#PBS -l nodes=1:ppn=1
#PBS -l walltime=168:00:00
#PBS -l mem=8gb

#PBS -M doudab@uga.edu 
#PBS -m ae

cd /scratch/doudab/shared/candidaSequel/map2agp//references

module load BWA/0.7.17-foss-2016b
bwa index SC5314A22.mfa

then qsub runindexrefs.

4. Mapping Illumina reads to the reference assembly

a. A very simple example with only 2 files of (paired) Illumina reads

Note: I actually used the more complicated code below for my analysis, so I haven't double-checked this. Whoever reads and uses this: please let me know if it works!

doudab@n563 map2agp$ cat runMap2SC5314A22

#PBS -S /bin/bash
#PBS -N runMap2SC5314A22
#PBS -q bensasson_q
#PBS -l nodes=1:ppn=1
#PBS -l walltime=168:00:00
#PBS -l mem=8gb

#PBS -M doudab@uga.edu 
#PBS -m ae

cd /scratch/doudab/shared/candidaSequel/map2agp/illuminareads/SC5314_WT/

module load BWA/0.7.17-foss-2016b
module load SAMtools/1.9-foss-2016b

bwa mem /scratch/doudab/shared/candidaSequel/map2agp/references/SC5314A22.mfa SRR850113_1.fastq.gz SRR850113_2.fastq.gz | samtools view -b - | samtools sort -o SC5314_WT.bam -

b. A more complicated example for 6 pairs of Illumina read files

Often, Illumina reads are generated in multiple different runs. Each run or lane will lead to a pair of files e.g. in Bensasson et al 2019, the Earlham institute generated 6 pairs of Illumina read files for each Candida albicans oak strain. We usually map each pair of read files to the reference genome (generating a .bam file each time), then merge all the bam files into one large bam file that contains all the reads for that strain mapped to the reference.

This example shows how to use a for loop with the unix language bash, and also shows use of the unix command basename. If you are more comfortable with scripting languages such as python, then you could use for loops in python to generate your shell script. It is also ok to type out your shell script long hand.

If you type out long-hand, then your error rate will probably be high: careful checks! If you use the example unix for loop, it also might not be doing what you think: careful checks!

doudab@n563 map2agp$ cat runMap2SC5314A22

#PBS -S /bin/bash
#PBS -N runMap2SC5314A22
#PBS -q bensasson_q
#PBS -l nodes=1:ppn=1
#PBS -l walltime=168:00:00
#PBS -l mem=8gb

#PBS -M doudab@uga.edu 
#PBS -m ae

cd /scratch/doudab/shared/candidaSequel/map2agp/illuminareads/SC5314_WT/

module load BWA/0.7.17-foss-2016b
module load SAMtools/1.9-foss-2016b

for sample in *_1.fastq.gz
do
  i=`basename \$sample _1.fastq.gz`
  input1="${i}_1.fastq.gz"
  input2="${i}_2.fastq.gz"
  output="${i}.readpair"

  bwa mem /scratch/doudab/shared/candidaSequel/map2agp/references/SC5314A22.mfa $input1 $input2 | samtools view -b - | samtools sort -o $output.bam -
done

samtools merge /scratch/doudab/shared/candidaSequel/map2agp/mapped/SC5314_WT.bam *.readpair.bam 

4. Generating a consensus for the Illumina reads mapped to reference:

Use bcftools (from Heng Li who also made samtools) to use the alignment and quality scores to generate consensus base calls for the Illumina reads at every position in the reference. The main calling segment of bcftools requires a choice between the -m option and the old -c option. Use the -c option. Do not use the the new -m option for multi-allelic base calls for Bensasson lab work (see comment below). The -m option will treat multi-allelic base calls as an inherent conflict and will result in a low quality score for the consensus: this is a major problem for fungi with low ploidy (haploid, homozygous or diploid with high read depth data). We relax the default samtools read depth limit (read depth 250, increased to 100,000). Having a high -d option supposedly requires memory, but this script required very little memory for yeast genomes. It does however take a long time to run on high-depth genomes: approx 50 mins for 300x 15Mbp genome, but only 12 mins for 75x 14Mbp genome.

An example shell script:

doudab@n562 map2agp$ cat runbam2vcfSC5314_WT

#PBS -S /bin/bash
#PBS -N runbam2vcfSC5314_WT
#PBS -q bensasson_q
#PBS -l nodes=1:ppn=1
#PBS -l walltime=168:00:00
#PBS -l mem=12gb

#PBS -M doudab@uga.edu 
#PBS -m ae

module load BCFtools/1.9-foss-2016b

cd /scratch/doudab/shared/candidaSequel/map2agp/mapped

bcftools mpileup -d 100000 -f /scratch/doudab/shared/candidaSequel/map2agp/references/SC5314A22.mfa SC5314_WT.bam | bcftools call -c > SC5314_WTc.vcf

IMPORTANT If you plan to create an alignment for phylogenetic or chromosome painting analysis then you need to change the above command so it includes the -I option to ignore small insertions or deletions. For example:

bcftools mpileup -I -d 100000 -f /scratch/doudab/shared/candidaSequel/map2agp/references/SC5314A22.mfa SC5314_WT.bam | bcftools call -c > SC5314_WTc.vcf

Use B-allele frequency plots to check your mapped reads and consensus

This generates shell scripts (e.g. runBafSC5314) that uses R (version 3.5.0-foss-2018a-X11-20180131-GACRC) to generate B-allele frequency plots for each strain using my perl script, vcf2allelePlot.pl.

An example

module load R/3.5.0-foss-2018a-X11-20180131-GACRC  
vcf2allelePlot.pl -i SC5314_WTm.vcf

Browse the output files:

dbensasson commented 4 years ago

5. Convert the vcf file to a fasta format file and check

On our interactive node (qsub -I -l walltime=24:00:00 -l nodes=1:ppn=1:dbnode instead of qlogin), use

module load BCFtools/1.9-foss-2016b
time vcfutils.pl vcf2fq SRR850113.vcf > SRR850113.fq

This took 1.5 minutes with C. albicans data (16 Mbp).

This filters all sequence so it is lower-case in the fasta format output file if it's phred-scaled quality is < Q40 (ie error rate is over 1 in 10,000 nucleotides):

module load seqtk/1.2-foss-2016b
seqtk seq -q 40 -A SRR850113.fq > SRR850113.fa

Look at your data and be familiar with the file formats

Browse around all the files you generate using less, head and tail (try man head if you're unsure about how these work). Use samtools view with less in the case of bam files. Simply type samtools view for the simplest usage instructions.

Summarizing the contents of fasta files

As well as seqtk, I use my own perl script to summarize base composition in a fasta file, because it summarizes every kind of base call in a file (not only ACGT): basecomp.pl. Please use this!

Generating a filtered consensus genome sequence for a whole genome alignment

Are you are generating a filtered consensus genome sequence for whole a whole genome alignment in fasta format because you would like to use phylogenetic or population genomic software that requires a fasta format alignment as input? If yes, then indels make alignment difficult. In the absence of indels relative to the reference, all sequences will be the same length as the reference, and sequences can be 'aligned' using the unix cat command.

If you are only interested in point substitutions for population genetic or phyolgenetic analysis, then use the -I option in your samtools mpileup command when generating the gvcf file. By default samtools mpileup will output invariant sites together with variant sites and the resulting file format is gvcf (genomewide vcf).