Assembly, Evaluation, and Annotation Pipeline

annamtown commented 6 years ago

The goal of this task is to combine what I have learned from the previous genome assembly assignments to create a single script that assembles, evaluates, and annotates three different genome assemblies (pacbio, reference assembly, and spades). This script will use QUAST, mummer plots, and Prokka to analyze each assembly.

annamtown commented 6 years ago

URL of reference assembly bash script on GitHub: https://github.com/annamtown/gene8940/blob/master/pipeline3.sh

qsub command used to submit job: qsub -q rcc-30d pipeline3.sh

version of Git repository used to run job: 0d3f74a

location of output directory for pipeline: /escratch4/s_11/s_11_Aug_17/assemblypipeline /escratch4/s_11/s_11_Aug_17/pipeline3

annamtown commented 6 years ago

I had difficulties with canu, which made it difficult to run quast and prokka. I believe I have fixed everything, but I used single command lines to do so. My most recent script should be reflect these changes, but it is not the most recently executed version. I reported my last executed script above, but here is the most recent script version for your reference: daf11c0

What is the N50 for the three assemblies? What does it tell you about the quality of the assemblies? Quast did not execute properly, so I ran quast from the command line in a quast directory I created within the pipeline3 directory. I found the N50 values in the .tvs files.

Reference assembly: N50 is 4641652

PacBio assembly: N50 is 4630757

Spades assembly: N50 is 133353

Based on solely on these N50 values, the reference assembly has the greatest N50 values, which indicates that it has the highest quality assembly. The PacBio/Canu assembly also has a slightly lower N50 length compared to the reference assembly - they differ by 10,895 bps.The SPAdes assembly has the lowest N50, indicating that it has the lowest quality compared to the PacBio and reference assemblies. Compared to the other assemblies' N50 values, which are in the millions, I would say that SPAdes is a very poor assembly with only a little over 100,000 bps.

What do the mummerplots tell you about the assemblies? Provide the plots below. Reference: https://github.com/annamtown/gene8940/blob/master/outputref_prefix.png PacBio/Canu: https://github.com/annamtown/gene8940/blob/master/outputpacbio_prefixls.png SPAdes: https://github.com/annamtown/gene8940/blob/master/outputspades_prefix.png

The reference assembly mummerplot looks very good - the dotplot illustrates a very straight line with very few "shifts" in the diagonals within each grid square. The line is red, indicating high similarity between sequences. The PacBio/Canu dotplot is poor. There are two diagonal lines in the plot - one in the top left corner and one toward the bottom left. This may indicate a collapsed repeat in the assembly. While the lines are red, indicating high sequence similarity, the plot should only exhibit one straight line. The spades dotplot is a bit more "shifty," with diagonals adjusting slightly within a greater number of grid "rectangles." This line is also red, indicating high similarity between sequences.

Mummerplots are dotpots that illustrate similarity between the reference and raw sequences.

How many coding sequences (CDS) did you find in each assembly? How do these numbers compare to the number of CDSs in the Ensembl reference genome annotation? What are the potential sources of variation and error that lead to differences among datasets?

Reference: 4,305, using this command (in prokka_directory_ref):

cut -f 3 PROKKA_10232017.gff | sort | grep -c 'CDS'

PacBio: 5,792, using this command (in prokka_directory_pacbio):

cut -f 3 PROKKA_10242017.gff | sort | grep -c 'CDS'

Spades: 4,242, using this command (in prokka_directory_spades directory):

cut -f 3 PROKKA_10232017.gff | sort | grep -c 'CDS'

Ensembl CDS: 4,140 (found on http://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655/Info/Annotation/#about)

The reference assembly and SPAdes assembly have very similar numbers of CDS compared to the Ensembl CDS total. However, the PacBio/Canu assembly reports an additional ~+1,600 CDSs compared to the Ensembl CDS number.

The potential sources of variation and error are due to different parameters within each assembly software and the quality of raw reads. Variation in read quality affects the efficiency and accuracy of the assembly. This is seen in a variety of statistics, such as the N50 values. The reference-based assembly has the highest of the reported N50 values, while the spades assembly has the lowest N50 value. A greater N50 is more desirable because it indicates longer sequences contain 50% of the genome. Ideally, you want higher N50 values and lower L50 values, where 50% of the genome is contained in very few, long sequences. Assemblers are also designed for particular sequence data. For example, Canu is specifically designed for PacBio and Oxford Nanopore data, Samtools/bwa is designed for short read data, and SPAdes is designed for Illumina or IonTorrent reads, but can also process PacBio, Oxford Nanopore and Sanger reads. Each assembler has its own pros and cons based on the type of raw sequencing data and the types of application desired from each assembly, such as evolution/phylogenetics, metagenomics, epigenomics, etc.

cbergman commented 6 years ago

your final script still won't run the canu assembler, since you haven't set the path to the GCC libraries properly. Line 54 should be the following:
```
export LD_LIBRARY_PATH=/usr/local/gcc/5.3.0/lib64:${LD_LIBRARY_PATH}
```
Also, lines 51 and 63 are redundant and not needed
N50 analysis and interpretation correct. However, the assembly with the largest N50 isn't always the "highest quality". N50 is one measure of quality.
Mummerplot interpretation for reference assembly is correct, but two diagonal lines in canu assembly represent a duplication, not a collapsed repeat. Mummerplot for Spades also shows that most of reference genome is covered but in many pieces
CDS counts are correct and reasonable interpretation of results. Could have been improved by noting that incompleteness of spades assembly leads to fewer genes relative to reference assembly. Higher CDS count in Canu assembly is because of high error rate in PacBio reads leading to small indels that disrupt gene predictions.
grade: 17/20

cbergman commented 6 years ago

Sorry, just realized you didn't load PNGs into the issue like I asked for. I won't bump you down for this, but just making a note before closing this assignment.

annamtown / gene8940

Assembly, Evaluation, and Annotation Pipeline #8