what are the differences among "recall_avg_bp", "recall_avg_bp_cami1", "recall_avg_bp_sem", "recall_avg_bp_sem_cami1"

CAMI-challenge / AMBER

AMBER: Assessment of Metagenome BinnERs

https://cami-challenge.github.io/AMBER/

GNU General Public License v3.0

28 stars 7 forks source link

what are the differences among "recall_avg_bp", "recall_avg_bp_cami1", "recall_avg_bp_sem", "recall_avg_bp_sem_cami1" #60

Closed jiaojiaoguan closed 3 weeks ago

jiaojiaoguan commented 3 weeks ago

Dear authors,

I am confused about the differences among the metrics mentioned above, can you give me some instructions?

Best, jiaojiao

jiaojiaoguan commented 3 weeks ago

hello, Thanks for your response in #40 . I just wanted to check with you and see if my understanding is correct. For example, there is a bin named bin1. It includes two contigs,contig1 and contig2. The information is below:

Contig length genome genome_length contig1 10bp g1 50bp contig2 100bp g2 1000bp

The contig1's length is 10bp and the contig2's length is 100bp. The "g" represents the genome. the g1's total length is 50bp and the g2's total length is 1000bp.

In cami1, we will assign g2 into bin1 since b1 the most abundant is contig2, which belongs to g2. But the completeness of the g1 is 10/50 and the completeness of the g2 is 100/1000. If we assign the genome by the completeness, the highest completeness is g1. Thus now the genome label of the bin is g1.

jiaojiaoguan commented 3 weeks ago

If the understanding above is correct, I do an experiment. I have total 355 contigs and 4 genomes and assign each contig a cluster id. The file is "cami2_marine_each_contig_each_cluster.csv". The ground truth is "cami_2_virus_gsa.csv". I use amber 2.0.7 to calculate the results and put it in "results.csv". cami_2_virus_gsa.csv cami2_marine_each_contig_each_cluster.csv results.csv

I think the recall_avg_seq_cami and recall_avg_seq should same "4/355". But the results are different. What is the reason?

fernandomeyer commented 3 weeks ago

recall_avg_seq_cami1 is computed from the average over all predicted bins (and 0s for unmapped genomes), whereas recall_avg_seq is computed from the averages over the single predicted bins containing the largest number of base pairs of the genomes. Therefore, they cannot be the same in your example.

This is how recall_avg_bp and recall_avg_seq are computed in your example (last row), which contains 4 genomes:

genome  genome_length   # sequences predicted_bin   bin_length  recall_bp   recall_seq
RNODE_531_length_9868_cov_8.44904   979599  100 cluster_152 9823    0.010028    0.01
RNODE_407_length_46658_cov_49.66972 464903  21  cluster_7   46618   0.100275    0.047619
RNODE_207_length_2195_cov_3.75886   112802  134 cluster_64  2144    0.019007    0.007463
RNODE_451_length_4052_cov_6.62773   390319  100 cluster_32  3919    0.010041    0.01
average-------------------->                    0.03483775  0.0187705

recall_avg_bp_cami1 recall_avg_seq_cami1 average over more bins.

jiaojiaoguan commented 3 weeks ago

recall_avg_seq_cami1 is computed from the average over all predicted bins (and 0s for unmapped genomes), whereas recall_avg_seq is computed from the averages over the single predicted bins containing the largest number of base pairs of the genomes. Therefore, they cannot be the same in your example.

This is how recall_avg_bp and recall_avg_seq are computed in your example (last row), which contains 4 genomes:
genome    genome_length   # sequences predicted_bin   bin_length  recall_bp   recall_seq
RNODE_531_length_9868_cov_8.44904 979599  100 cluster_152 9823    0.010028    0.01
RNODE_407_length_46658_cov_49.66972   464903  21  cluster_7   46618   0.100275    0.047619
RNODE_207_length_2195_cov_3.75886 112802  134 cluster_64  2144    0.019007    0.007463
RNODE_451_length_4052_cov_6.62773 390319  100 cluster_32  3919    0.010041    0.01
average-------------------->                  0.03483775  0.0187705
recall_avg_bp_cami1 recall_avg_seq_cami1 average over more bins.

Thanks very much for your detailed reply!