Closed jiaojiaoguan closed 3 weeks ago
hello, Thanks for your response in #40 . I just wanted to check with you and see if my understanding is correct. For example, there is a bin named bin1. It includes two contigs,contig1 and contig2. The information is below:
Contig length genome genome_length contig1 10bp g1 50bp contig2 100bp g2 1000bp
The contig1's length is 10bp and the contig2's length is 100bp. The "g" represents the genome. the g1's total length is 50bp and the g2's total length is 1000bp.
In cami1, we will assign g2 into bin1 since b1 the most abundant is contig2, which belongs to g2. But the completeness of the g1 is 10/50 and the completeness of the g2 is 100/1000. If we assign the genome by the completeness, the highest completeness is g1. Thus now the genome label of the bin is g1.
If the understanding above is correct, I do an experiment. I have total 355 contigs and 4 genomes and assign each contig a cluster id. The file is "cami2_marine_each_contig_each_cluster.csv". The ground truth is "cami_2_virus_gsa.csv". I use amber 2.0.7 to calculate the results and put it in "results.csv". cami_2_virus_gsa.csv cami2_marine_each_contig_each_cluster.csv results.csv
I think the recall_avg_seq_cami and recall_avg_seq should same "4/355". But the results are different. What is the reason?
recall_avg_seq_cami1 is computed from the average over all predicted bins (and 0s for unmapped genomes), whereas recall_avg_seq is computed from the averages over the single predicted bins containing the largest number of base pairs of the genomes. Therefore, they cannot be the same in your example.
This is how recall_avg_bp and recall_avg_seq are computed in your example (last row), which contains 4 genomes:
genome genome_length # sequences predicted_bin bin_length recall_bp recall_seq
RNODE_531_length_9868_cov_8.44904 979599 100 cluster_152 9823 0.010028 0.01
RNODE_407_length_46658_cov_49.66972 464903 21 cluster_7 46618 0.100275 0.047619
RNODE_207_length_2195_cov_3.75886 112802 134 cluster_64 2144 0.019007 0.007463
RNODE_451_length_4052_cov_6.62773 390319 100 cluster_32 3919 0.010041 0.01
average--------------------> 0.03483775 0.0187705
recall_avg_bp_cami1 recall_avg_seq_cami1 average over more bins.
recall_avg_seq_cami1 is computed from the average over all predicted bins (and 0s for unmapped genomes), whereas recall_avg_seq is computed from the averages over the single predicted bins containing the largest number of base pairs of the genomes. Therefore, they cannot be the same in your example.
This is how recall_avg_bp and recall_avg_seq are computed in your example (last row), which contains 4 genomes:
genome genome_length # sequences predicted_bin bin_length recall_bp recall_seq RNODE_531_length_9868_cov_8.44904 979599 100 cluster_152 9823 0.010028 0.01 RNODE_407_length_46658_cov_49.66972 464903 21 cluster_7 46618 0.100275 0.047619 RNODE_207_length_2195_cov_3.75886 112802 134 cluster_64 2144 0.019007 0.007463 RNODE_451_length_4052_cov_6.62773 390319 100 cluster_32 3919 0.010041 0.01 average--------------------> 0.03483775 0.0187705
recall_avg_bp_cami1 recall_avg_seq_cami1 average over more bins.
Thanks very much for your detailed reply!
Dear authors,
I am confused about the differences among the metrics mentioned above, can you give me some instructions?
Best, jiaojiao