DEploid-dev / dEploidPaper

0 stars 0 forks source link

reviewer 1 comment #11

Closed shajoezhu closed 7 years ago

shajoezhu commented 7 years ago

Comments to the Author The authors have developed a tool, DEploid, to infer the number of strains in a mixed infection of Plasmodium falciparum, the relative proportion of each strain and their respective haplotypes. A tool like this would be extremely useful for analysis of malaria datasets and could greatly advance our knowledge of this disease.

The deconvolution method is another adaptation of the Li and Stephens model, this time applied to the deconvolution/phasing of SNPs in isolates with unknown number of component clones, with unknown contributions. This is the type of data that is typically encountered by malaria researchers and the authors make use of the MalariaGen Pf3K dataset to demonstrate its application. The paper comes with software: dePLOID.

This is a nice application of a well-established model for phasing. The paper is written in a very concise manner and requires some gap filling to become a bit more comprehensible. The results leave several questions open. While the authors do assess the performance of DEploid, general descriptions of the analysis process, data processing steps and model parameters are vague and/or lacking. The webpage describing DEploid is nicely done.

2.2 Model:

pd0577-c eg inbreedingnew interpretdeploidfigure 2

3.1 Accuracy:

3.2 Comparison to existing methods:

Concerns:

Supplementary material:

Minor comments:

jalmagro commented 7 years ago

"Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting."

This may be the most time-consuming thing to do but totally doable, it will require some more thinking about how to mimic the coverage profiles characteristic of our field samples. Problem is that once we have this kind of simulated data, reviewers will ask to compare with other methods, I'm sure. What do you reckon? Overkilling?

jalmagro commented 7 years ago

"Furthermore, it is not at all clear that this software will be of much use to most researchers working with WGS (or at the very least high throughput SNP genotyping data) since reference panels are crucial to this algorithm and there may not be sufficient reference samples at hand"

Very easy to address, we just point them to our public datasets, including Pf3k! We also make a note of Pf6, our next data release will contain 7K samples, all open access (by the end of the summer).

jalmagro commented 7 years ago

"Could the authors also comment in the paper on the potential use of in silico haplotypes derived from the read pairs themselves to help with the phasing here?"

Doable again but I do not think it is worthy to go down this route. Something similar is what estMOI uses to estimate COI.

jalmagro commented 7 years ago

"You assume a prior in which the haplotypes of the n strains are independent of each other. What happens if the input sample contains related strains?"

Is this really true? I thought the prior assumption is just that the diversity within the mixed infection is well represented by the strains of the reference panel. Is this right?

jalmagro commented 7 years ago

"A typical reference panel would contain haplotypes from field samples constructed from the user. Therefore, one might expect results similar to Panel I in Figure 2. A reference panel like this does not seem to affect estimates of the number of strains or their relative proportions in an infection, however haplotype inference does not look flash. Perhaps address this in the discussion? If haplotype inference is not reliable then this tool is not terribly useful as other popular tools are available to estimate strain numbers and their relative proportions."

Overkilling approach would be to simulate a mixture of strains from two different populations and see results when composing different reference panel (pop A, pop B and a mix of A/B).

shajoezhu commented 7 years ago

I am using clonal samples from asiaGroup1, subset to chromosome 14 (183888 sites), and simulating coverage of two types of mixing 25/75% and 45/55%. I repeat this process 100 times, for each replicate,

I randomly pick 12 haplotypes from the 212 clonal samples, haplotypes 1:10 are the reference panel used for deconvolution. Haplotypes 11 and 12 are saved as the truth. I then compute the WSAF as WSAF = 0.25 * hap_11 + 0.75

Run DEploid (exclude sites where PLAF = 0, deconvolve 8071 sites), and extract the number of switches and genotype errors.

genotype switches

On Tue, Apr 25, 2017 at 9:44 AM, Jacob Almagro notifications@github.com wrote:

"Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting."

This may be the most time-consuming thing to do but totally doable, it will require some more thinking about how to mimic the coverage profiles characteristic of our field samples. Problem is that once we have this kind of simulated data, reviewers will ask to compare with other methods, I'm sure. What do you reckon? Overkilling?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shajoezhu/dEploidPaper/issues/11#issuecomment-296961973, or mute the thread https://github.com/notifications/unsubscribe-auth/ADhX_ePFXCbkJDuQ2T_ZNTm04no2yuhTks5rzbKHgaJpZM4NE4XZ .

shajoezhu commented 7 years ago

We see reductions in both genotype and miscopying error when doubling the reference panel size for simulated field samples.

On Thu, Apr 27, 2017 at 5:54 PM, Sha Zhu sha.joe.zhu@gmail.com wrote:

I am using clonal samples from asiaGroup1, subset to chromosome 14 (183888 sites), and simulating coverage of two types of mixing 25/75% and 45/55%. I repeat this process 100 times, for each replicate,

I randomly pick 12 haplotypes from the 212 clonal samples, haplotypes 1:10 are the reference panel used for deconvolution. Haplotypes 11 and 12 are saved as the truth. I then compute the WSAF as WSAF = 0.25 * hap_11 + 0.75

  • hap_12, adjust the WSAF with error, includeErrorWSAF = WSAF(1-err)+(1-WSAF)err, where err = 0.01. I extract the coverage of sample 11, and simulate the alternative read count by the binomial distribution, where the number of repeats is the total depth and the proportion is includeErrorWSAF. Subtract the alternative read count from the total coverage to get the reference allele counts.

Run DEploid (exclude sites where PLAF = 0, deconvolve 8071 sites), and extract the number of switches and genotype errors.

On Tue, Apr 25, 2017 at 9:44 AM, Jacob Almagro notifications@github.com wrote:

"Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting."

This may be the most time-consuming thing to do but totally doable, it will require some more thinking about how to mimic the coverage profiles characteristic of our field samples. Problem is that once we have this kind of simulated data, reviewers will ask to compare with other methods, I'm sure. What do you reckon? Overkilling?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shajoezhu/dEploidPaper/issues/11#issuecomment-296961973, or mute the thread https://github.com/notifications/unsubscribe-auth/ADhX_ePFXCbkJDuQ2T_ZNTm04no2yuhTks5rzbKHgaJpZM4NE4XZ .