DEploid-dev / dEploidPaper

0 stars 0 forks source link

Reviewer 2 comment #12

Closed shajoezhu closed 7 years ago

shajoezhu commented 7 years ago

Comments to the Author The reconstruction of genomic sequences from mixed populations of pathogens from NGS data is highly relevant, as the most abundant haplotype not necessarily is most relevant or explains the infection phenotype. The ability to determine the multiplicity of infections, strain ratios and retrieving the haplotypes is highly wanted for surveillance and treatment of infectious diseases. Existing methods for this purpose are rare and limited, which results in high demand for new and better methods in this area. I therefore support the publication of this manuscript in principle. It presents an interesting method and is applicable to data of a highly relevant pathogen. However, the limitation to Plasmodium falciparum is also a fundamental problem. In bioinformatics, methods need to be as generic as possible. It would be impracticable to develop specific methods for the genotyping of each individual pathogen. The manuscript and the bioinformatics community would very much benefit from additional data (based on simulation and real NGS reads), which indicates the performance of dEploid for other species (see below).

Major points

Minor points

jalmagro commented 7 years ago

" Experimental validation (Table 2). The mixed samples used are well known, therefore the choice of reference genomes for the reference panel and the samples for “PLAF” is obvious. How would that principle extend to unknown mixtures? The performance of the tool with “unknown” simulated datasets and a larger number of different strains used for the PLAF would be crucial to know. Also, because of the MCMC sampling, the percentages shown could vary when re-run with the same parameters. Instead of single values, distributions (e.g. means and variances) need to be shown."

This is related to some comments of reviewer 1, they are asking for a better assessment of accuracy when little is known of the origin population or data are scarce. We can support our approach, again, by pointing out we are releasing tons of data (Pf3k and the next data release, 7K samples), so building reference panels shouldn't be a big concern for the majority of cases (same for PLAF estimation).

The second comment goes back to our discussion about showing the posterior distribution for our estimates.

jalmagro commented 7 years ago

"Application to other species (Discussion). It is not clear how the concept can be applied to data of species from different biological domains like stated in the discussion: “bacterial or viral pathogens”. To use dEploid for other organisms the composition of the populations would be required to construct a reasonable PLAF matrix. In an attempt to apply dEploid to bacterial data with a PLAF and panel constructed from 26 reference genomes, we were able to retrieve the relative abundance of the most abundant sample in mixtures of up to 3 strains (min. 10%, max 80%) in most cases. The results (Fig. 1) were varying strongly when re-running the tool on the same dataset. The determination of multiplicity of infections and the haplotype reconstruction were not successful."

This is a dangerous territory, best solution would be to ask for the data and see what's the problem. A lot of extra work for us...