Homework 3 - Githubissues

fae75933 commented 4 years ago

Goal: I hope to learn how to use large data sets to do different types of computational analysis.

fae75933 commented 4 years ago

a) Report the N50 and L50 for both assemblies and state what these values mean. N50: When you have a set of contigs, N50 is the sequence length of the shortest contig at 50% of the total genome length. L50: number is the smallest number of contigs whose length sum makes up half of genome size.

Canu: ecoli.contigs, N50 = 4631892, L50 = 1, Total length = 4631892, GC % = 50.77, # N's per 100 kbp = 0.00 Spades: contigs, N50 = 125511, L50 = 14, Total length = 4564842, GC % = 50.74, # N's per 100 kbp = 0.00 I am not sure if this is correct but I tried to figure it out and this is what I got. This homework was very stressful.

fae75933 commented 4 years ago

b) Upload .pngs of mummerplots for both assemblies and describe what these plots show. canu spades These plots are an alignment dotplot where a sequence is laid out on each axis and a point is plotted at every position where the two sequences show similarity. These two plots look the same for Canu and Spades, which I think this means I did something wrong but I would rather submit something.

fae75933 commented 4 years ago

c) The URL to the location of the script on GitHub: https://github.com/fae75933/BNIF8940/blob/master/Homework3 d) The Git revision used for your final analysis: 8e43e674480f5b240e1b23d6459810d751889678

shunhuahan commented 4 years ago

a) The N50 and L50 explanations are correct. The N50 and L50 values for CANU assembly is correct. The N50 and L50 values for spades assembly are close. What you report are the stats from spades contigs instead of scaffolds. Actually, you can use the following line in your script to generate N50 and L50 for both CANU and spades assemblies. Line 50 is not necessary. https://github.com/fae75933/BNIF8940/blob/8e43e674480f5b240e1b23d6459810d751889678/Homework3#L48

b) You uploaded two plots but both are for CANU assembly. Please check the following line in your script, you used canu.delta instead of spades.delta file for spades mummerplot. https://github.com/fae75933/BNIF8940/blob/8e43e674480f5b240e1b23d6459810d751889678/Homework3#L63

The basic description is correct. To be more specific, the Canu png shows assembly of a single circular contig that is perfectly colinear and very similar in sequence to the reference genome, but appears offset because it starts at a different first base than the reference genome. The Spades .png shows a highly fragmented assembly of scaffolds of a range of sizes that are very similar in sequence to the reference genome, but with no major misassemblies relative to the reference genome. The mummerplot for spades is attached as follows. NODE_23_length_71731_cov_103 532113

c) Correct  d) Correct

Good job on the homework! @fae75933 Please don't stress out too much and let us know if you need more time off class with me or @cbergman to help you with your script.

fae75933 / BNIF8940

Homework 3 #3