reviewer 1 comment - Githubissues

Comments to the Author The authors have developed a tool, DEploid, to infer the number of strains in a mixed infection of Plasmodium falciparum, the relative proportion of each strain and their respective haplotypes. A tool like this would be extremely useful for analysis of malaria datasets and could greatly advance our knowledge of this disease.

The deconvolution method is another adaptation of the Li and Stephens model, this time applied to the deconvolution/phasing of SNPs in isolates with unknown number of component clones, with unknown contributions. This is the type of data that is typically encountered by malaria researchers and the authors make use of the MalariaGen Pf3K dataset to demonstrate its application. The paper comes with software: dePLOID.

This is a nice application of a well-established model for phasing. The paper is written in a very concise manner and requires some gap filling to become a bit more comprehensible. The results leave several questions open. While the authors do assess the performance of DEploid, general descriptions of the analysis process, data processing steps and model parameters are vague and/or lacking. The webpage describing DEploid is nicely done.

[x] Several of the figures are too small. For example Figure 2 could do with an excerpt zooming in on a chromosome to show the results more clearly. Another alternative is to remove the black bars from the image to make the founder haplotypes more visible. Figure S2.2 is far too small. Since this is supplementary data there is no penalty to spread these figures out over several pages.
- We have updated all figures, and removed black bars from Figure 2 (tagged by REV1.1 in the main text and supplementary material).
[x] Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting.
- We conducted simulation studies of mimicing pf3k samples in the main text (REV1.2.1) to investigate the switch and genotype errors. and simulation studies to investigate the coverage requirement in supplementary materials (REV1.2.2).
[ ] Furthermore, it is not at all clear that this software will be of much use to most researchers working with WGS (or at the very least high throughput SNP genotyping data) since reference panels are crucial to this algorithm and there may not be sufficient reference samples at hand.
- addressd by REV1.3
[ ] Could the authors also comment in the paper on the potential use of in silico haplotypes derived from the read pairs themselves to help with the phasing here?
- addressd by REV1.4

2.2 Model:

[x] You assume a prior in which the haplotypes of the n strains are independent of each other. What happens if the input sample contains related strains?
- This is an excellent point. We do not actually assume the n strains are independent to each other within the same sample. The assumption is that reference strains are independent from each other. But the deconvoluted strains can copy from the same reference strain independently. In fact, we can deconvolute inbred samples with the current implementation, see the following example. However, the current method struggles more complex inbreeding, and low coverage data. We are improving the method with a new component of IBD, which has been implemented, but as it is different, and we will discuss this in the new paper, where we will apply the new method to build a recombination map from the field samples.

pd0577-c eg inbreedingnew interpretdeploidfigure 2

3.1 Accuracy:

[x] How many SNPs were used in the analysis?
- We discuss this using the filtering step, addressd by REV1.6. Extracted 18570 high quality biallelic SNPs from Pf3k data, after filtering we use 17,530 for the experiment.
[x] Why do you assume there are at most 3 strains present in the mixtures when the default value is 5 strains? Do your results differ when you assume 5 strains are present?
- Very good point. The inference is robust, with subtle differences when assuming different number. We address by REV1.7 in the updated Figure 2.
[x] - How many reference strains were used for the analysis presented in Table 2 and what strains were these? Were they the baseline reference haplotypes for the four parent strains? Using the four parent strains should produce the best possible results which is unrealistic with field isolates.
- addressd by REV1.8
[x] - Figure 2: It would be useful to include how many SNPs were included for analysis on chromosome 14 in the figure legend.
- 2369, fixed in the updated figure axis, see REV1.9

3.2 Comparison to existing methods:

[x] - COIL uses genotype information. How did you generate the genotype data used here? Perhaps more information on data processing would be useful.
- Addressd by REV1.10 in the supplement.
[x] - BEAGLE requires a reference dataset to infer haplotype phase, typically a large reference dataset. What reference dataset was used? Parameters used in this analysis, and the analysis of other software, would also be useful.
- addressd by REV1.11.1 and REV1.11.2
[x] - Figure 3: pfmix infers the number of strains and their proportions, therefore please add the numbers of strains estimated by pfmix to Figure 3 panel (a) for comparison.
- Unfortunately we couldn't get pfmix to work on the same dataset. With 4000 iterations, the method stopped with the error of the following
```
Error in ans[ok3] <- dbinom(x = x[ok3], size = size[ok3], prob = prob1,  : 
NAs are not allowed in subscripted assignments
Calls: run.mcmc ... mh.mcmc -> calc.llk -> dbeta.binom.zi -> dbetabinom.ab
Execution halted
```
  when reduce the number of iterations, pfmix returns incorrect results. We feel this is unfair to pfmix, therefore, we modified the code, and skip the model selection step, and fix the number of strains, and infer the proportions only, addressd by REV1.12
[x] - Figure 3: DEploid infers 6 samples as containing 3 strains when they really only contain 2 strains (as shown in Table 2). Why are these not represented in Figure 3 panel (a)?
- The overfitting of these six samples was due to markers with both high frequencies for reference alleles and alternative alleles. It is fixed after the filtering step. We show this online at https://github.com/mcveanlab/DEploid/wiki/FAQ#data-filtering. And we explain another type of overfitting for our program, and this can be adjusted by running the program with a different value for the paramter sigma addressd by REV1.13 in supplemary material.
[x] - Figure 3: cannot read the figure legends and axes.
- addressd by REV1.14

Concerns:

[x] - A typical reference panel would contain haplotypes from field samples constructed from the user. Therefore, one might expect results similar to Panel I in Figure 2. A reference panel like this does not seem to affect estimates of the number of strains or their relative proportions in an infection, however haplotype inference does not look flash. Perhaps address this in the discussion? If haplotype inference is not reliable then this tool is not terribly useful as other popular tools are available to estimate strain numbers and their relative proportions.
- addressd by REV1.15
[x] - Was any filtering of poor quality SNPs performed? This would seem prudent for haplotype phasing.
- addressd by REV1.16
[x] - Is the Gibbs update for the pair of haplotypes performed always in tandem with the single haplotype update?
- addressd by REV1.17

Supplementary material:

[x] Figure S2.2: inconsistencies in WSAF in figures (a) and (b). Histogram of WSAF in panel (a) shows clustering around 0.3 and 1 while the distributions of WSAF across each chromosome in panel (b) cluster around 0.3 and 0?
- Addressed in caption, we actually exclude points of WSAF strictly equal to 0s and 1s.

Minor comments:

[x] Figure S2.2 ‘blue dots’
- Fixed by REV1.19
[x] P6 O’Brien (2016)
- Fixed by REV1.20
[x] P6 – BEAGLE would implicitly assume a 50:50 distribution of alleles with its diploid assumption.
- Fixed by REV1.21
[x] P6. “ten most different” – different how? Define.
- We compute the pairwise differences between strains, and choose ten strains that have the greatest distance. addressd by REV1.22

"Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting."

This may be the most time-consuming thing to do but totally doable, it will require some more thinking about how to mimic the coverage profiles characteristic of our field samples. Problem is that once we have this kind of simulated data, reviewers will ask to compare with other methods, I'm sure. What do you reckon? Overkilling?

"Furthermore, it is not at all clear that this software will be of much use to most researchers working with WGS (or at the very least high throughput SNP genotyping data) since reference panels are crucial to this algorithm and there may not be sufficient reference samples at hand"

Very easy to address, we just point them to our public datasets, including Pf3k! We also make a note of Pf6, our next data release will contain 7K samples, all open access (by the end of the summer).

"Could the authors also comment in the paper on the potential use of in silico haplotypes derived from the read pairs themselves to help with the phasing here?"

Doable again but I do not think it is worthy to go down this route. Something similar is what estMOI uses to estimate COI.

"You assume a prior in which the haplotypes of the n strains are independent of each other. What happens if the input sample contains related strains?"

Is this really true? I thought the prior assumption is just that the diversity within the mixed infection is well represented by the strains of the reference panel. Is this right?

"A typical reference panel would contain haplotypes from field samples constructed from the user. Therefore, one might expect results similar to Panel I in Figure 2. A reference panel like this does not seem to affect estimates of the number of strains or their relative proportions in an infection, however haplotype inference does not look flash. Perhaps address this in the discussion? If haplotype inference is not reliable then this tool is not terribly useful as other popular tools are available to estimate strain numbers and their relative proportions."

Overkilling approach would be to simulate a mixture of strains from two different populations and see results when composing different reference panel (pop A, pop B and a mix of A/B).

I am using clonal samples from asiaGroup1, subset to chromosome 14 (183888 sites), and simulating coverage of two types of mixing 25/75% and 45/55%. I repeat this process 100 times, for each replicate,

I randomly pick 12 haplotypes from the 212 clonal samples, haplotypes 1:10 are the reference panel used for deconvolution. Haplotypes 11 and 12 are saved as the truth. I then compute the WSAF as WSAF = 0.25 * hap_11 + 0.75

hap_12, adjust the WSAF with error, includeErrorWSAF = WSAF(1-err)+(1-WSAF)err, where err = 0.01. I extract the coverage of sample 11, and simulate the alternative read count by the binomial distribution, where the number of repeats is the total depth and the proportion is includeErrorWSAF. Subtract the alternative read count from the total coverage to get the reference allele counts.

Run DEploid (exclude sites where PLAF = 0, deconvolve 8071 sites), and extract the number of switches and genotype errors.

genotype switches

On Tue, Apr 25, 2017 at 9:44 AM, Jacob Almagro notifications@github.com wrote:

"Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting."

This may be the most time-consuming thing to do but totally doable, it will require some more thinking about how to mimic the coverage profiles characteristic of our field samples. Problem is that once we have this kind of simulated data, reviewers will ask to compare with other methods, I'm sure. What do you reckon? Overkilling?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shajoezhu/dEploidPaper/issues/11#issuecomment-296961973, or mute the thread https://github.com/notifications/unsubscribe-auth/ADhX_ePFXCbkJDuQ2T_ZNTm04no2yuhTks5rzbKHgaJpZM4NE4XZ .

We see reductions in both genotype and miscopying error when doubling the reference panel size for simulated field samples.

On Thu, Apr 27, 2017 at 5:54 PM, Sha Zhu sha.joe.zhu@gmail.com wrote:

I am using clonal samples from asiaGroup1, subset to chromosome 14 (183888 sites), and simulating coverage of two types of mixing 25/75% and 45/55%. I repeat this process 100 times, for each replicate,

I randomly pick 12 haplotypes from the 212 clonal samples, haplotypes 1:10 are the reference panel used for deconvolution. Haplotypes 11 and 12 are saved as the truth. I then compute the WSAF as WSAF = 0.25 * hap_11 + 0.75

hap_12, adjust the WSAF with error, includeErrorWSAF = WSAF(1-err)+(1-WSAF)err, where err = 0.01. I extract the coverage of sample 11, and simulate the alternative read count by the binomial distribution, where the number of repeats is the total depth and the proportion is includeErrorWSAF. Subtract the alternative read count from the total coverage to get the reference allele counts.

Run DEploid (exclude sites where PLAF = 0, deconvolve 8071 sites), and extract the number of switches and genotype errors.

On Tue, Apr 25, 2017 at 9:44 AM, Jacob Almagro notifications@github.com wrote:

"Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting."

This may be the most time-consuming thing to do but totally doable, it will require some more thinking about how to mimic the coverage profiles characteristic of our field samples. Problem is that once we have this kind of simulated data, reviewers will ask to compare with other methods, I'm sure. What do you reckon? Overkilling?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shajoezhu/dEploidPaper/issues/11#issuecomment-296961973, or mute the thread https://github.com/notifications/unsubscribe-auth/ADhX_ePFXCbkJDuQ2T_ZNTm04no2yuhTks5rzbKHgaJpZM4NE4XZ .

DEploid-dev / dEploidPaper

reviewer 1 comment #11