QBIOLAB - Week 1 Assignment

Specific comments (Point deductions in parenthesis at the end of comments when applicable):

Although I see the effort that went into trying to make stacked histograms, they are very difficult to read. The simulated data is mostly occluded by the normal/poisson distributions. You could have effectively shown the same result by only displaying the simulated data as a histogram, and the normal/poisson distributions as a geom_line() that connects the points on each break of the simulated data. (-0.25 per plot)
The legend positions appear to be misplaced at 30x, mainly because of the use of annotate() with hard-coded XY coordinates for where the labels should go. This can be avoided by the use of appropriate aes(), where the ‘fill=’ parameter is inside the aesthetic and can be used to establish the series in the legend, scale_fill_manual() can be used to define the exact hex value to represent each series, and ‘legend.position=’ in theme() can help you place it in the right spot. (-0.1 per plot)
```
ggplot(...)+
 geom_col(aes(..., fill=’Poisson’)) +
 geom_col(aes(..., fill=’Normal’)) +
 scale_fill_manual(values=c(‘Poisson’=’blue’,’Normal’=’red’) +
 theme(legend.position=c(0.85,0.85),...)
```
Also rather minor, but the genome size (1000000) in your R code is present as a hard-coded number instead of stored in a variable with an informative name. Please avoid hard-coded numbers whenever possible.
In response to your 2.5 answer, you were supposed to write a possible genome sequence, but you shared a possible path on your graph and reported the sequence of edges. We wanted you to collapse the edges(TTC TCT -> TTCT) and propose a sequence (in your case: TTCTTATTGATTCATTT) (-0.25)
In response to your 2.6 answer, the major weakness of the De-bruijin graph is that it cannot resolve repeats unless the read size is longer than the repeat (in which case you won’t be using DB graphs as the solution for assembly). You also cannot possibly know the expected number of repeats before constructing the graph, unless the sequence has been assembled before with longer reads that span the entire repeat. The number of repeated times you count an identical k-mer only gives you a sense of how many times a particular sequence was sequenced, but now how many instances of that sequence are present sequentially in the actual assembly. You would still benefit from longer and more accurate k-mers when resolving non-repetitive sequences though (or promiscuous repetitive sequences that are only partially similar to one another, e.g. ATTCTA|ATTCGA|ATTCTA)

Code: 5/5 Answers: 2.75/3 Plots: 0.95/2

Total: 8.7/10

ettaschaye / qbb2024-answers

QBIOLAB - Week 1 Assignment #8