Open fae75933 opened 4 years ago
a) Report the N50 and L50 for both assemblies and state what these values mean. N50: When you have a set of contigs, N50 is the sequence length of the shortest contig at 50% of the total genome length. L50: number is the smallest number of contigs whose length sum makes up half of genome size.
Canu: ecoli.contigs, N50 = 4631892, L50 = 1, Total length = 4631892, GC % = 50.77, # N's per 100 kbp = 0.00 Spades: contigs, N50 = 125511, L50 = 14, Total length = 4564842, GC % = 50.74, # N's per 100 kbp = 0.00 I am not sure if this is correct but I tried to figure it out and this is what I got. This homework was very stressful.
b) Upload .pngs of mummerplots for both assemblies and describe what these plots show. These plots are an alignment dotplot where a sequence is laid out on each axis and a point is plotted at every position where the two sequences show similarity. These two plots look the same for Canu and Spades, which I think this means I did something wrong but I would rather submit something.
c) The URL to the location of the script on GitHub: https://github.com/fae75933/BNIF8940/blob/master/Homework3 d) The Git revision used for your final analysis: 8e43e674480f5b240e1b23d6459810d751889678
a) The N50 and L50 explanations are correct. The N50 and L50 values for CANU assembly is correct. The N50 and L50 values for spades assembly are close. What you report are the stats from spades contigs instead of scaffolds. Actually, you can use the following line in your script to generate N50 and L50 for both CANU and spades assemblies. Line 50 is not necessary. https://github.com/fae75933/BNIF8940/blob/8e43e674480f5b240e1b23d6459810d751889678/Homework3#L48
b) You uploaded two plots but both are for CANU assembly. Please check the following line in your script, you used canu.delta
instead of spades.delta
file for spades mummerplot.
https://github.com/fae75933/BNIF8940/blob/8e43e674480f5b240e1b23d6459810d751889678/Homework3#L63
The basic description is correct. To be more specific, the Canu png shows assembly of a single circular contig that is perfectly colinear and very similar in sequence to the reference genome, but appears offset because it starts at a different first base than the reference genome. The Spades .png shows a highly fragmented assembly of scaffolds of a range of sizes that are very similar in sequence to the reference genome, but with no major misassemblies relative to the reference genome. The mummerplot for spades is attached as follows.
c) Correct d) Correct
Goal: I hope to learn how to use large data sets to do different types of computational analysis.