CAMI-challenge / AMBER

AMBER: Assessment of Metagenome BinnERs
https://cami-challenge.github.io/AMBER/
GNU General Public License v3.0
25 stars 7 forks source link

"nan" values in Purity (bp) and Purity (seq) #39

Closed apcamargo closed 4 years ago

apcamargo commented 4 years ago

Hi,

I'm using AMBER to evaluate a set of bins I obtained from a metagenome assembled from the CAMI Toy Mouse Gut Dataset reads. I've noticed that some bins have nan values in the Purity (bp) and Purity (seq) columns. What might be causing that?

To build the gold standard I aligned the reassembled contigs to the original genomes using BLAST, as described in Vamb's paper:

We removed any hits shorter than 500 bp or with lower nucleotide identity than 95%. If a query (reassembled) contig was aligned to multiple reference (original) contigs, we accepted the reference with the longest alignment, if the alignment was more than twice as long of the next longest. If that was not the case for any reference, we accepted the reference with highest nucleotide identity, if the reference was longer than 10 kbp, had an alignment length of at least 90% of the longest-aligning reference, and had at least 0.05% higher nucleotide identity than the second-highest identity reference. If no reference fit those criteria, they were ignored in the benchmarking.

Bin ID Most abundant genome Purity (bp) Completeness (bp) Bin size (bp) True positives (bp) True size of most abundant genome (bp) Purity (seq) Completeness (seq) Bin size (seq) True positives (seq) True size of most abundant genome (seq)
mouse_gut_5.vamb.83 denovo8255.1 0.659 0.983 741631 488937 497562 0.655 0.980 226 148 151
mouse_gut_5.vamb.9 269125.1 0.998 0.977 1809513 1806528 1848507 0.994 0.977 172 171 175
mouse_gut_5.vamb.81 179513.0 1.000 0.963 807379 807379 838686 1.000 0.962 280 280 291
mouse_gut_5.vamb.126 228785.0 1.000 0.960 308660 308660 321520 1.000 0.954 103 103 108
mouse_gut_5.vamb.72 661259.1 1.000 0.940 241230 241230 256496 1.000 0.944 85 85 90
mouse_gut_5.vamb.182 259993.0 1.000 0.922 647949 647949 702765 1.000 0.917 211 211 230
mouse_gut_5.vamb.454 denovo11208.0 0.525 0.919 1065913 559173 608144 0.531 0.895 305 162 181
mouse_gut_5.vamb.11 133719.0 0.760 0.915 534069 405959 443893 0.748 0.888 127 95 107
mouse_gut_5.vamb.111 denovo11993.0 1.000 0.875 636280 636280 727174 1.000 0.882 194 194 220
mouse_gut_5.vamb.793 4471135.0 0.992 0.863 1913155 1898009 2200037 0.990 0.845 583 577 683
mouse_gut_5.vamb.333 denovo12532.0 nan 0.857 202574 199887 233197 nan 0.852 76 75 88
mouse_gut_5.vamb.51 denovo2465.0 nan 0.848 218613 218613 257907 nan 0.863 82 82 95
mouse_gut_5.vamb.115 denovo1032.0 nan 0.816 44891 24716 30297 nan 0.750 11 6 8
mouse_gut_5.vamb.71 denovo10679.0 0.333 0.787 654629 218217 277372 0.341 0.622 82 28 45
mouse_gut_5.vamb.451 denovo11206.0 0.893 0.761 1096925 979172 1287345 0.872 0.718 298 260 362
mouse_gut_5.vamb.428 denovo2609.0 0.998 0.733 1077877 1075782 1468385 0.997 0.717 325 324 452
mouse_gut_5.vamb.1036 263992.0 nan 0.231 66518 41954 181693 nan 0.242 23 15 62
fernandomeyer commented 4 years ago

Hello, the purity of the smallest bins is set to nan if you're using the --filter option, say, --filter 1. Then the smallest bins corresponding to 1% of the binned data by a binner is "removed", which in practice means that their purity is set to nan and they are not considered in the average purity.

apcamargo commented 4 years ago

Thanks!