reviewer 3 comments - Githubissues

shajoezhu commented 7 years ago

Reviewer: 3

Comments to the Author Review of Zhi et al "Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data"

This paper describes how to infer the mixture decomposition of multiple strains of haploid organisms when multiple, related strains may be present in the same sample. This is an important problem in bacterial genetics, as argued by the authors, and they present a workable solution to this. The solution used, to use a copying model and perform markov-chain monte carlo analysis to extract out the appropriate details for the copying model, is an interesting novel application of these methods. To the best of my understanding it is correctly implemented and performs a useful job.

[x] So I'm generally positive about this paper. I don't have major concerns, but as it stands it is not very easy to read. It is laid out in the classic mathematical style, which is to say to get to the results the reader has to slog through a lot of complex descriptions of mcmc updates, which have not been given any context or intuition. The writing is not bad but the ms would benefit hugely from a) a reorganisation to hide the gore from an interested biological-minded reader, and b) some effort to explain the details in intuitive terms. Some specific suggestions are listed below.
- I have moved the math part to the supplementary material.
[x] More technically, I found the technical details to be slightly unsatisfactorally explored. Specific concerns were the arbitrary value of G=20 (page 4) which scales the recombination rate. This is pretty unconvincing. I agree that the model usually allows for some misspecification of the recombination rate but something much better could be done. Either do the right thing (inference of G by EM or analogously) or show that it is insensitive.
- In practice, we deconvolve over 1 million markers of field samples, we use a value of G = 20 to ensure small values for recombination probabilities between two markers, with a mean of 0.015. (Tagged by REV3.2)
[x] I also disliked the anecdotalaity of Figure 2 - I was not clear what the general takehome message was meant to be, and the plot with its many black bars is quite confusing.
- Removed the black bars. This is is similar to reviewer 1's comment, tagged by REV3.3.1 and REV3.3.2. The takehome message is meant to be that when we include more relevant strains in the reference panel, it improves the deconvolution result with both fewer switch and genotype errors.

Minor comments:

[x] Figure 3: c is a noisy plot. It would be much clearer if shown with a smoothing. It would inform the reader to say what the take home message of all plots should be in the legend.
- Fixed this in the updated figure, tagged by REV3.4.1 and REV3.4.2
[x] Page 2 right: what is c? it isn't defined? In general the model section needs some effort in clarification.
- "c" reflects how much data is available. The average coverage for the data (at the markers we deconvolute) is above 100. Hence we set c = 100. We address this by REV3.5 in the main text
[x] Page 2: sp: inversley
- Thanks for spotting this, addressed by REV3.6.
[x] Page 3: titre: this is not a common term. What is wrong wit concentration? I think this is what you mean anyway? I find no evidence that titre has this meaning in statistics, only in chemistry, though I appreciate that there are many fields I'm not familiar with.
- Yes and no. The log titre behaves in a similar way as the concentration parameter -- the same expectation expression. But this is strickly not the same as the Dirichlet distribution, which will result in a complicated form when computing the hastings ratio for the Metropolis–Hastings algorithm, and the moves between x and x' is not symmtrical. We try to avoid the confusion with Dirichlet process, hence not calling it the concentration parameter.
[x] Page 4: "Such erroneous markers are not currently inferred by DEploid, though this could be included in future versions." If it is easy, do it. If it is not easy, don't offer. In my experience very few pieces of academic software are maintained and developed in this way.
- We apply the filtering step to exclude these markers, addressed by REV3.8, in main text and supplementary material. This software will be maintained and developed as part of the Pf3k project. As the project finishes, it will likely be maintained through the MalariaGEN network.

jalmagro commented 7 years ago

"So I'm generally positive about this paper. I don't have major concerns, but as it stands it is not very easy to read. It is laid out in the classic mathematical style, which is to say to get to the results the reader has to slog through a lot of complex descriptions of mcmc updates, which have not been given any context or intuition. The writing is not bad but the ms would benefit hugely from a) a reorganisation to hide the gore from an interested biological-minded reader, and b) some effort to explain the details in intuitive terms. Some specific suggestions are listed below."

I think he/she has a point here. A broad discussion of the algorithm, step by step, and moving the math into the supp. material would make the paper more appealing (and easy to read).

jalmagro commented 7 years ago

"More technically, I found the technical details to be slightly unsatisfactorally explored. Specific concerns were the arbitrary value of G=20 (page 4) which scales the recombination rate. This is pretty unconvincing. I agree that the model usually allows for some misspecification of the recombination rate but something much better could be done. Either do the right thing (inference of G by EM or analogously) or show that it is insensitive."

A fair point although my experience with Pf is that the painting model tends to be very robust unless extreme values of recombination are used (tried with a set of ranges, for instance, for the inbreeding analysis). We can rerun the model with different scaling factors and show this or go for the EM run, but I would avoid implementing anything new at this point.

jalmagro commented 7 years ago

"I also disliked the anecdotalaity of Figure 2 - I was not clear what the general takehome message was meant to be, and the plot with its many black bars is quite confusing."

We need a different representation for haplotypes, maybe just rendering differences.

DEploid-dev / dEploidPaper

reviewer 3 comments #13