bxlab / bx-python

Tools for manipulating biological data, particularly multiple sequence alignments
MIT License
148 stars 54 forks source link

Exception: 2147483648 is larger than the maximum possible size (2147483647) #48

Open mr-c opened 5 years ago

mr-c commented 5 years ago

Hello,

Debian is not currently able to build bx-python on the armel, armhf, i386, mipsel architectures (amongst others) due to the following error:


======================================================================
ERROR: bx.interval_index_file_tests.test_interval_index_file
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.7_bx-python/build/bx/interval_index_file_tests.py", line 22, in test_interval_index_file
    ix.add( name, start, end, i, max=interval_index_file.MAX )
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.7_bx-python/build/bx/interval_index_file.py", line 276, in add
    self.indexes[name] = Index( max=max )
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.7_bx-python/build/bx/interval_index_file.py", line 346, in __init__
    self.new( min, max )
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.7_bx-python/build/bx/interval_index_file.py", line 364, in new
    self.offsets = offsets_for_max_size( max )
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.7_bx-python/build/bx/interval_index_file.py", line 136, in offsets_for_max_size
    raise Exception( "%d is larger than the maximum possible size (%d)" % ( max_size, BIN_OFFSETS_MAX[0] ) )
Exception: 2147483648 is larger than the maximum possible size (2147483647)

Example build log: https://buildd.debian.org/status/fetch.php?pkg=python-bx&arch=armel&ver=0.8.4-2&stamp=1564652009&raw=0

Should we skip this test for now?

jxtx commented 5 years ago

Interesting. Is sys.maxsize == (2**31 - 1) on these platforms? We are probably making a bad assumption here:

https://github.com/bxlab/bx-python/blob/master/lib/bx/interval_index_file.py#L126

That should probably be sys.maxsize. Sanity check @rsharris?

rsharris commented 5 years ago

--- Short answer ---

(1) I think line 120 is wrong, in two ways. First, it should be a power of two, and on my machine sys.maxsize is 2**63-1.

Second, assuming the index should be able to exist in a file, it should relate to the bytes allocated in the file format, not the system.

I'm pretty sure it should be 231. But ... if ">I" is a signed 32-bit int, then the max power of two it can hold is 230.

(2) I think the test at line 133 is probably wrong, and instead it should check for less-than-or-equal.

(3) At line 136, the text of the exception assumes BIN_OFFSETS_MAX[0] caused the rejection, but it should be BIN_OFFSETS_MAX[i].

(4) The comments describing the file layout are out of date with the code.

--- Long answer, less coherent ---

The original test was probably something I wrote. Not very fresh in my memory though (to say the least!).

I haven't been able to find my circa 2005 implementation of this. I did find the C implementation.

I think MAX is supposed to relate to the maximum interval end that will be storable in the file format. In my 2005 C implementation, it looks like this was called MAX_INTERVAL_END. I presume the file format has a fixed number of bytes in which to represent an interval end. Thus MAX should be independent of the system and instead should depend on the file format/layout/design.

Whether MAX should be 231, 231-1, or 2**32-1 ... I'd have to dig more into the code.

(digging)

It looks like at some point the file format has been enhanced beyond version 1 (line 105 and comment at line 124). But the comments describing the file format weren't completely updated, since they still claim version 1 (line 8 and lines 25 thru 81). That description matches my 2005 description.

At line 451 it writes an interval's start and end to the file without modification. So (assuming line 78 still accurately reflects the file format), we have 32 bits for the end value, and the maximum allowable end would be 2^31.

I don't understand why the test at line 133 is max_size<max instead of max_size<=max. In the comment at line 130, does "(0,max_size)" mean the interval is intended to be closed? The only similar reject message I find in my 2005 code rejects when something is > MAX_INTERVAL_END. But the test at line 133 rejects when something is >= MAX_INTERVAL_END.

SO ... I think MAX is correct, but that the test at line 133 is probably wrong.

It's also possible (probable) that line 120 is wrong. On my machine sys.maxsize is 2**63-1.

Bob H

On Sep 17, 2019, at 10:30 AM, James Taylor notifications@github.com wrote:

Interesting. Is sys.maxsize 2**31 - 1 on these platforms? We are probably making a bad assumption here: https://github.com/bxlab/bx-python/blob/master/lib/bx/interval_index_file.py#L126 https://github.com/bxlab/bx-python/blob/master/lib/bx/interval_index_file.py#L126 That should probably sys.maxsize. Sanity check @rsharris https://github.com/rsharris?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bxlab/bx-python/issues/48?email_source=notifications&email_token=AAFDOI55L2DY2IUQAE65JNDQKDSW5A5CNFSM4IWPAKJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD64XEOQ#issuecomment-532247098, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFDOI2J7CTWCHVRWBGPIY3QKDSW5ANCNFSM4IWPAKJA.

mr-c commented 5 years ago

@rsharris @jxtx This is keeping the new versions of the package from Debian's testing distribution. Shall we skip these tests, or is there a real bug on the architectures that needs fixing?

Re:

Is sys.maxsize == (2**31 - 1) on these platforms?

As @rsharris said, sys.maxsize is 2**63-1 on amd64, and 2**31-1 on i386, armel, armhf, and mipsel

rsharris commented 5 years ago

( Ignoring the confusion (which might only be mine) about whether BIN_OFFSETS_MAX[0] should be a power of 2. )

My two cents:

What I see is (a) a potential issue with a disagreement in field sizes, and (b) a potential issue of cross-platform portability.

What I mean by field sizes ... fields are written to the file as ">I", but the expectation during design may have been that these were 32 bits. If ">I" is a different size on some platforms there's a potential for bugs. Looking at the code it seems to calculate everything using calcsize(), so maybe this isn't a problem.

If, however, a file is written on a platform with one size ">I" and read on a platform with a different ">I", then obviously the whole thing will fail. I don't know whether those sizes vary on different platforms, but I expect that they would.

It could be there's no intent to support cross-platform portability. If so, it seems like the size of ">I" should be indicated in the file header so that readers could reject files with the wrong ">I" size.

mr-c commented 4 years ago

I made a PR making many of the changes suggested above at https://github.com/bxlab/bx-python/pull/54