Question about the vRhyme paper

Nianzhen-GU commented 2 years ago

Hi,

I encountered one question when reading the vRhyme paper. In the part Score processing, 'Each binning iteration is given a score I according to protein redundancy, total bins, and the number of scaffolds binned'. I want to know what the binning iteration is? Is it corresponding to the previously mentioned grid search method?

Thank you!

KrisKieft commented 2 years ago

Sequences are compared by coverage ratios using an effect size metric. Between any pair of sequences, this yields an effect size per sample which are aggregated into a single value.
Sequences are compared by nucleotide composition. Between any pair of sequences, the probabilities of same vs different genome from the machine learning models are aggregated into a single value.
The effect size and nucleotide composition values are used to build weighted networks.

Steps 1-3 above each have their own parameters that can be tuned. For example, setting a significant effect size to be 0.7 vs 0.5. The same goes for machine learning probabilities and network edge weights. So, to answer your question, instead of attempting to optimize these parameters to fit any dataset, which is generally not possible, vRhyme bins over many iterations to find the best fit. Each iteration is a collection of different cutoff values for the 3 steps above. What happens is the parameters are optimized for your dataset by selecting which iteration performs best. Since binning is not a supervised approach (we don't know the true answer) vRhyme has a scoring method to rank iterations.

Nianzhen-GU commented 2 years ago

Thank you very much for your explanation!

AnantharamanLab / vRhyme

Question about the vRhyme paper #8