Evaluating Gold standard encountered RuntimeWarning: overflow

CAMI-challenge / AMBER

AMBER: Assessment of Metagenome BinnERs

https://cami-challenge.github.io/AMBER/

GNU General Public License v3.0

25 stars 7 forks source link

Evaluating Gold standard encountered RuntimeWarning: overflow #50

Closed yazhinia closed 1 year ago

yazhinia commented 1 year ago

Hello, AMBER run on my dataset showed a numerical overflow. evaluating Gold standard (sample marine, genome binning) ~/.local/lib/python3.8/site-packages/src/binning_classes.py:306: RuntimeWarning: overflow encountered in long_scalars return (n * (n - 1)) / 2.0 This comes from the function compute_rand_index in binning_classes.py. What would be the reason for this error?

Thanks.

Best, Yazhini

fernandomeyer commented 1 year ago

Thanks for reporting this. This happens when the number of sequences is very large, typically binning of reads, as the code cannot handle very large numbers. It is unlikely to happen for binning of contigs. A quick fix is to replace return (n * (n - 1)) / 2.0 by return math.comb(n, 2) in binning_classes.py, but then the error occurs in other lines of the code. I don't have a full solution to this yet, but I will keep this issue updated. The good news is that this only affects the Rand index metric.

yazhinia commented 1 year ago

Thanks for the quick reply. In fact, I am working on binning contigs. Would my input cause this issue?

amber.py allsample_bins -g binning_gs.tsv -o .

allsample_bins looks like this

@Version:0.9.1

@SampleID:marine

@@SEQUENCEID    BINID
S0C251206   1
S0C377455   1
S0C883820   1
S0C902373   1
S0C1092500  1

and

binning_gs.tsv (gold standard) looks like this

@Version:0.9.1

@SampleID:marine

@@SEQUENCEID    BINID   TAXID   _LENGTH
S0C878739   Otu255  62322   653
S0C700632   Otu229.0    300231  879
S0C763456   Otu1227 129337  263
S0C367135   Otu405.0    1238    358
S0C429264   Otu937  40269   264
S0C261835   Otu1920 1094    547
S0C1156244  Otu889.0    52959   150

I have combined data from all samples.

yazhinia commented 1 year ago

This is to note that when I change (n * (n-1)) / 2.0 to math.comb(n, 2), an error message raised as RuntimeWarning: overflow encountered in long_scalars at the below line temp = (bin_comb * mapping_comb / num_bp_comb) if num_bp_comb != 0 else .0

fernandomeyer commented 1 year ago

Your example above leads to another error resulted from a division by zero, because none of the sequences in allsample_bins can be found in the gold standard binning_gs.tsv.

yazhinia commented 1 year ago

Sorry, that are just sample lines. I rechecked my input and all contig ids are present in the binning_gs.tsv file. I only see this error for a large dataset. Not sure what is the cause.

Thanks again.

yazhinia commented 1 year ago

Hi, I find the reason that when I use assembled contigs from each sample, the sum of the genome length is above 10 billion. When choose2(n) function (in binning_classes.py) is called for the value, numerical overflow occurs. Thereby, a variable num_bp_comb posseses a wrong value and affects ARI calculation.

So, if I understand correctly, CAMI-II assessment for binning used contigs from only pooled assembly not single sample assembly. For pooled data, I don't see the error as the numbers are within the limit of np.int64 handling.

fernandomeyer commented 1 year ago

Yes, in CAMI II, genome binning is of contigs from pooled assemblies, as described.

Commit 1ffebfb0338c003728b74bba852c384ff5844850 in the dev branch fixes the overflow. It's not an ideal solution because the sums are not done with a numpy function, but it seems to work.

All python packages are being updated, so there will be further testing before the next AMBER release.

fernandomeyer commented 1 year ago

Solved in AMBER v2.0.4