Closed yazhinia closed 1 year ago
Thanks for reporting this. This happens when the number of sequences is very large, typically binning of reads, as the code cannot handle very large numbers. It is unlikely to happen for binning of contigs. A quick fix is to replace
return (n * (n - 1)) / 2.0
by
return math.comb(n, 2)
in binning_classes.py, but then the error occurs in other lines of the code. I don't have a full solution to this yet, but I will keep this issue updated.
The good news is that this only affects the Rand index metric.
Thanks for the quick reply. In fact, I am working on binning contigs. Would my input cause this issue?
amber.py allsample_bins -g binning_gs.tsv -o .
allsample_bins looks like this
@Version:0.9.1
@SampleID:marine
@@SEQUENCEID BINID
S0C251206 1
S0C377455 1
S0C883820 1
S0C902373 1
S0C1092500 1
and
binning_gs.tsv (gold standard) looks like this
@Version:0.9.1
@SampleID:marine
@@SEQUENCEID BINID TAXID _LENGTH
S0C878739 Otu255 62322 653
S0C700632 Otu229.0 300231 879
S0C763456 Otu1227 129337 263
S0C367135 Otu405.0 1238 358
S0C429264 Otu937 40269 264
S0C261835 Otu1920 1094 547
S0C1156244 Otu889.0 52959 150
I have combined data from all samples.
This is to note that when I change (n * (n-1)) / 2.0
to math.comb(n, 2)
, an error message raised as RuntimeWarning: overflow encountered in long_scalars at the below line
temp = (bin_comb * mapping_comb / num_bp_comb) if num_bp_comb != 0 else .0
Your example above leads to another error resulted from a division by zero, because none of the sequences in allsample_bins
can be found in the gold standard binning_gs.tsv
.
Sorry, that are just sample lines. I rechecked my input and all contig ids are present in the binning_gs.tsv file. I only see this error for a large dataset. Not sure what is the cause.
Thanks again.
Hi,
I find the reason that when I use assembled contigs from each sample, the sum of the genome length is above 10 billion. When choose2(n)
function (in binning_classes.py) is called for the value, numerical overflow occurs. Thereby, a variable num_bp_comb
posseses a wrong value and affects ARI calculation.
So, if I understand correctly, CAMI-II assessment for binning used contigs from only pooled assembly not single sample assembly. For pooled data, I don't see the error as the numbers are within the limit of np.int64 handling.
Yes, in CAMI II, genome binning is of contigs from pooled assemblies, as described.
Commit 1ffebfb0338c003728b74bba852c384ff5844850 in the dev branch fixes the overflow. It's not an ideal solution because the sums are not done with a numpy function, but it seems to work.
All python packages are being updated, so there will be further testing before the next AMBER release.
Solved in AMBER v2.0.4
Hello, AMBER run on my dataset showed a numerical overflow.
evaluating Gold standard (sample marine, genome binning) ~/.local/lib/python3.8/site-packages/src/binning_classes.py:306: RuntimeWarning: overflow encountered in long_scalars return (n * (n - 1)) / 2.0
This comes from the functioncompute_rand_index
in binning_classes.py. What would be the reason for this error?Thanks.
Best, Yazhini