Reviewer 2 comment - Githubissues

shajoezhu commented 7 years ago

Comments to the Author The reconstruction of genomic sequences from mixed populations of pathogens from NGS data is highly relevant, as the most abundant haplotype not necessarily is most relevant or explains the infection phenotype. The ability to determine the multiplicity of infections, strain ratios and retrieving the haplotypes is highly wanted for surveillance and treatment of infectious diseases. Existing methods for this purpose are rare and limited, which results in high demand for new and better methods in this area. I therefore support the publication of this manuscript in principle. It presents an interesting method and is applicable to data of a highly relevant pathogen. However, the limitation to Plasmodium falciparum is also a fundamental problem. In bioinformatics, methods need to be as generic as possible. It would be impracticable to develop specific methods for the genotyping of each individual pathogen. The manuscript and the bioinformatics community would very much benefit from additional data (based on simulation and real NGS reads), which indicates the performance of dEploid for other species (see below).

Major points

[x] 1. Experimental validation (Table 2). The mixed samples used are well known, therefore the choice of reference genomes for the reference panel and the samples for “PLAF” is obvious. How would that principle extend to unknown mixtures? The performance of the tool with “unknown” simulated datasets and a larger number of different strains used for the PLAF would be crucial to know. Also, because of the MCMC sampling, the percentages shown could vary when re-run with the same parameters. Instead of single values, distributions (e.g. means and variances) need to be shown.
1. We are releasing tons of data (Pf3k and the next data release, 7K samples), so building reference panels shouldn't be a big concern for the majority of cases (same for PLAF estimation).
2. The percentages shown could vary when re-run with the same parameters. This is a very good point. Yes, when rerunning the program, we the proportion value do vary. We repeat the deconvolution 30 times, and show how it varies when estimating the effective number of strains. addressed by REV2.1
[x] 2. Application to other species (Discussion). It is not clear how the concept can be applied to data of species from different biological domains like stated in the discussion: “bacterial or viral pathogens”. To use dEploid for other organisms the composition of the populations would be required to construct a reasonable PLAF matrix. In an attempt to apply dEploid to bacterial data with a PLAF and panel constructed from 26 reference genomes, we were able to retrieve the relative abundance of the most abundant sample in mixtures of up to 3 strains (min. 10%, max 80%) in most cases. The results (Fig. 1) were varying strongly when re-running the tool on the same dataset. The determination of multiplicity of infections and the haplotype reconstruction were not successful.
- In our experience, we use all available allele frequencies to compute the PLAF. In the case of falcipruim, since it highly diverse among geographical regions. We compute the PLAF and build reference panels by seven geographical regions when analizying pf3k field samples. For different species and dataset, we suspect a more suitable filtering step should have been taken before deconvolution. However, this is diffiucult to anticipate without any data exploration in practice. In the supplement, we provide examples of how filtering step works for our experiment, and hope it will inspire other filtering steps to be taken when analyzing another different oganism. In the supplement, we show examples for adjusting the parameter sigma to improve the deconvolution for very imbalanced samples.
- In an attempt to the Plasmodium vivax (Pearson et al., 2016) deconvolution, we found DEploid works well for most samples. However, it struggles with samples with both low coverage and high inbreeding. We have developed a new method accordingly, implemented with the "-ibd" flag. We are in preparation of another manualscript for the new method and its application.

Minor points

[x] 1. Other sequencing technologies. As the error rate can be adjusted in dEploid, how well would the tool perform on data originating from different sequencing technologies (e.g. PacBio or Oxford Nanopore Technologies)?
- Thanks for this. We are in fact in progress to work with ONT data. We address this in the Discussion, and tagged by REV2.3.
[x] 2. InDels and structural variants. When reconstructing haplotypes, indels and structural variation also need to be considered, while dEploid only reconstructs SNPs. This should be address in the discussion.
- We address this in the Discussion by REV2.4

jalmagro commented 7 years ago

" Experimental validation (Table 2). The mixed samples used are well known, therefore the choice of reference genomes for the reference panel and the samples for “PLAF” is obvious. How would that principle extend to unknown mixtures? The performance of the tool with “unknown” simulated datasets and a larger number of different strains used for the PLAF would be crucial to know. Also, because of the MCMC sampling, the percentages shown could vary when re-run with the same parameters. Instead of single values, distributions (e.g. means and variances) need to be shown."

This is related to some comments of reviewer 1, they are asking for a better assessment of accuracy when little is known of the origin population or data are scarce. We can support our approach, again, by pointing out we are releasing tons of data (Pf3k and the next data release, 7K samples), so building reference panels shouldn't be a big concern for the majority of cases (same for PLAF estimation).

The second comment goes back to our discussion about showing the posterior distribution for our estimates.

jalmagro commented 7 years ago

"Application to other species (Discussion). It is not clear how the concept can be applied to data of species from different biological domains like stated in the discussion: “bacterial or viral pathogens”. To use dEploid for other organisms the composition of the populations would be required to construct a reasonable PLAF matrix. In an attempt to apply dEploid to bacterial data with a PLAF and panel constructed from 26 reference genomes, we were able to retrieve the relative abundance of the most abundant sample in mixtures of up to 3 strains (min. 10%, max 80%) in most cases. The results (Fig. 1) were varying strongly when re-running the tool on the same dataset. The determination of multiplicity of infections and the haplotype reconstruction were not successful."

This is a dangerous territory, best solution would be to ask for the data and see what's the problem. A lot of extra work for us...

DEploid-dev / dEploidPaper

Reviewer 2 comment #12