[x] p76, point 2: it is said that one must use read 1 or read 2 in the region where they overlap. Ideally, one would actually check that they agree, and if not, either use the call with the highest quality, or ignore these reads at that position.
[x] p76, last equation, missing a "|" at the end.
[x] p83, last paragraph: "2000kb" should be "2000bp"
[x] Section 4.4.1: The beta-values have a bimodal distribution in most samples, with a peak around 1 and a peak around 0. It is unclear to me whether the peak at 0 represents a true biological phenomenon, or if this is simply the effect of not taking incomplete bisulfite conversion into account.
[x] p108: it is stated that "it does not make sense to refer to individual cytosines as methylated". I think that it can actually be useful to filter out positions that are not methylated and where the signal is driven by incomplete conversion, the question then becomes to identify methylated [PH: cytosines] that are never methylated, which is really the same thing as defining methylated [PH: cytosines] that are methylated (at least in some cell subpopulation of the sample).
I have clarified this point and noted how methtuple handles such overlapping mates.
Fixed, thank you.
Fixed, thank you.
I believe this is a true biological phenomenon. All figures in this section require use a minimum sequencing coverage of 10x. We can compute the probability of observing 10 Cs from 10 reads under the assumption that all Cs are due to incomplete conversion. Let X = the number of Cs, p = the incomplete bisulfite-conversion rate = 0.01 (conversion rates are typically > 99%). We are interested in Pr(X = 10 ) = 0.01^10 = 10^-20. This makes it unlikely these peaks are 0 are simply due to incomplete bisulfite-conversion.
While I understand this line of reasoning, as I argue in my thesis, I believe this terminology can be an unnecessary source of confusion.
Examiner _#_1