BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
119 stars 48 forks source link

Interpreting the Validate.pl output #301

Open gavinmdouglas opened 3 years ago

gavinmdouglas commented 3 years ago

Hi there,

I have a few questions about how to interpret output of Validate.pl.

In the "Complete Example v0.4" section of the documentation there is this description:

This gives the no. of contigs N clustered, the number with labels M, the number of unique labels S, the number of clusters K, the recall, the precision, the normalised mutual information (NMI), the Rand index, and the adjusted Rand index. It also generates a file called a confusion matrix with the frequencies of each species in each cluster.

What does the TL column refer to?

Also, how exactly are precision and recall calculated for the output, i.e. what are the positives and negatives being compared? Similarly for the NMI and Rand values - is there a more detailed description somewhere of how these are calculated? I have found general descriptions of these indices, but I'm unsure what precise observations are being used to compute them.

Last, what are the units of the counts in the confusion matrix output table? I'm not sure what the frequency of each species refers to here - I thought at first it might be how many contigs per bin were assigned to the species, but it looks like the counts sum to be much higher than the total number of contigs.

Thanks in advance,

Gavin