Blosc / bcolz

A columnar data container that can be compressed.
http://bcolz.blosc.org
959 stars 149 forks source link

Will bcolz support numpy 1.9? #41

Closed talumbau closed 10 years ago

talumbau commented 10 years ago

We are interested in including bcolz in the upcoming Anaconda release, which will use numpy 1.9. Do you have plans to support this? I hope so, since it's a very useful package. Thanks!

esc commented 10 years ago

is there anything specific that isn't working right now?

ilanschnell commented 10 years ago

For example on a 32-bit Linux Centos 5 system, I get:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.7.1
NumPy version:     1.9.0
Blosc version:     1.4.1 ($Date:: 2014-07-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   not available (version >= 1.4.1 not detected)
Python version:    2.7.8 |Continuum Analytics, Inc.| (default, Aug 21 2014, 1\
8:22:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux2-i686
Byte-ordering:     little
Detected cores:    1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing only a light (yet comprehensive) subset of the test suite.
If you want a more complete test, try passing the '-heavy' flag to this
script (or set the 'heavy' parameter in case you are using bcolz.test()
call).  The whole suite will take more than 30 seconds to complete on a
relatively modern CPU and around 300 MB of RAM and 500 MB of disk
[32-bit platforms will always run significantly more lightly].

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
......s......s......s......s.................................................\
.............................................................................\
.............................................................................\
...............................................................Segmentation fault
esc commented 10 years ago

meh :/

asmeurer commented 10 years ago

Note that if you guys can get this fixed by the end of the week, that would be the most helpful. Otherwise, we will have to ship a release of Anaconda that does not have bcolz, since we plan to include NumPy 1.9 in this release. We would also have to remove it as a hard dependency of blaze.

esc commented 10 years ago

I don't think that Blaze has a hard-dep on bcolz? IIRC it was removed with the recent Blaze release?

esc commented 10 years ago

I don't see bcolz in the Blaze deps:

https://github.com/ContinuumIO/blaze/blob/master/requirements.txt

esc commented 10 years ago

Also, I can't reproduce the segfault. Is there anything special you did?

zsh» python -c "import bcolz; bcolz.test()"
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.7.1
NumPy version:     1.9.0
Blosc version:     1.4.1 ($Date:: 2014-07-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.3.1
Python version:    2.7.8 |Anaconda 2.0.1 (64-bit)| (default, Aug 21 2014, 18:22:21) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux2-x86_64
Byte-ordering:     little
Detected cores:    4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing only a light (yet comprehensive) subset of the test suite.
If you want a more complete test, try passing the '-heavy' flag to this
script (or set the 'heavy' parameter in case you are using bcolz.test()
call).  The whole suite will take more than 30 seconds to complete on a
relatively modern CPU and around 300 MB of RAM and 500 MB of disk
[32-bit platforms will always run significantly more lightly].

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
..........................................................................................................................................................................................ssssss
----------------------------------------------------------------------
Ran 806 tests in 3.813s

OK (skipped=6)
esc commented 10 years ago

Hmmm, maybe the 32-bit system?

ilanschnell commented 10 years ago

Yes, on 64-bit Linux it works.

esc commented 10 years ago

Not sure if it is related to numpy 1.9, but I get the following on a Red Hat 32-bit AMI:

[ec2-user@ip-172-31-42-166 bcolz]$ python -c "import bcolz; bcolz.test()"  
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.7.2.dev
NumPy version:     1.9.0
Blosc version:     1.4.1 ($Date:: 2014-07-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   not available (version >= 1.4.1 not detected)
Python version:    2.7.8 |Continuum Analytics, Inc.| (default, Aug 21 2014, 18:22:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux2-i686
Byte-ordering:     little
Detected cores:    1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing only a light (yet comprehensive) subset of the test suite.
If you want a more complete test, try passing the '-heavy' flag to this
script (or set the 'heavy' parameter in case you are using bcolz.test()
call).  The whole suite will take more than 30 seconds to complete on a
relatively modern CPU and around 300 MB of RAM and 500 MB of disk
[32-bit platforms will always run significantly more lightly].

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
......s......s......s......s........................................ss..............................................................................................................................F...F.......................F.............................................................................................................................................................................................................................................................................................ssssss........../home/ec2-user/bcolz/bcolz/tests/test_carray.py:637: RuntimeWarning: overflow encountered in long_scalars
  self.assertEqual(sum(a[3:]), sum(u))
F..F.......................................................................................................................................................................................................................................................................................................................................................
======================================================================
FAIL: test02 (test_ctable.large_iterblocksDiskTest)
Testing `iterblocks` method with no stop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/bcolz/bcolz/tests/test_ctable.py", line 1247, in test02
    self.assertEqual(s, (np.arange(blen-1, N, dtype='f8')*2).sum())
AssertionError: 20086674.0 != 100581168.0

======================================================================
FAIL: test02 (test_ctable.large_iterblocksMemoryTest)
Testing `iterblocks` method with no stop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/bcolz/bcolz/tests/test_ctable.py", line 1247, in test02
    self.assertEqual(s, (np.arange(blen-1, N, dtype='f8')*2).sum())
AssertionError: 19966518.0 != 99980298.0

======================================================================
FAIL: test02 (test_ctable.small_iterblocksDiskTest)
Testing `iterblocks` method with no stop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/bcolz/bcolz/tests/test_ctable.py", line 1247, in test02
    self.assertEqual(s, (np.arange(blen-1, N, dtype='f8')*2).sum())
AssertionError: 1070.0 != 4578.0

======================================================================
FAIL: test02 (test_carray.large_viewDiskTest)
Testing view() and iterators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/bcolz/bcolz/tests/test_carray.py", line 637, in test02
    self.assertEqual(sum(a[3:]), sum(u))
AssertionError: 704982701 != 4999949997L

======================================================================
FAIL: test02 (test_carray.large_viewMemoryTest)
Testing view() and iterators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/bcolz/bcolz/tests/test_carray.py", line 637, in test02
    self.assertEqual(sum(a[3:]), sum(u))
AssertionError: 704982701 != 4999949997L

----------------------------------------------------------------------
Ran 873 tests in 11.669s

FAILED (failures=5, skipped=12)
esc commented 10 years ago

With numpy 1.8 I get this, which is a few less errors:

[ec2-user@ip-172-31-42-166 bcolz]$ python -c "import bcolz; bcolz.test()"
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.7.2.dev
NumPy version:     1.8.2
Blosc version:     1.4.1 ($Date:: 2014-07-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   not available (version >= 1.4.1 not detected)
Python version:    2.7.8 |Continuum Analytics, Inc.| (default, Aug 21 2014, 18:22:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux2-i686
Byte-ordering:     little
Detected cores:    1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing only a light (yet comprehensive) subset of the test suite.
If you want a more complete test, try passing the '-heavy' flag to this
script (or set the 'heavy' parameter in case you are using bcolz.test()
call).  The whole suite will take more than 30 seconds to complete on a
relatively modern CPU and around 300 MB of RAM and 500 MB of disk
[32-bit platforms will always run significantly more lightly].

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
......s......s......s......s........................................ss........................................................................................................................................................................................................................................................................................................................................................................................................................................................ssssss........../home/ec2-user/bcolz/bcolz/tests/test_carray.py:637: RuntimeWarning: overflow encountered in long_scalars
  self.assertEqual(sum(a[3:]), sum(u))
F..F.......................................................................................................................................................................................................................................................................................................................................................
======================================================================
FAIL: test02 (test_carray.large_viewDiskTest)
Testing view() and iterators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/bcolz/bcolz/tests/test_carray.py", line 637, in test02
    self.assertEqual(sum(a[3:]), sum(u))
AssertionError: 704982701 != 4999949997L

======================================================================
FAIL: test02 (test_carray.large_viewMemoryTest)
Testing view() and iterators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/bcolz/bcolz/tests/test_carray.py", line 637, in test02
    self.assertEqual(sum(a[3:]), sum(u))
AssertionError: 704982701 != 4999949997L

----------------------------------------------------------------------
Ran 873 tests in 12.821s

FAILED (failures=2, skipped=12)
esc commented 10 years ago

Note that the two above are using current bcolz master.

esc commented 10 years ago

With 0.7.1 and numpy 1.8 seems fine on 32 bit.

esc commented 10 years ago

Okay great, I can reproduce the segmentation fault, using Numpy 1.9 / Cython 0.20.2 using a miniconda on a Red Hat 32 bit AMI:

[ec2-user@ip-172-31-42-166 bcolz]$ python -c "import bcolz; bcolz.test()"
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.7.1
NumPy version:     1.9.0
Blosc version:     1.4.1 ($Date:: 2014-07-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   not available (version >= 1.4.1 not detected)
Python version:    2.7.8 |Continuum Analytics, Inc.| (default, Aug 21 2014, 18:22:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux2-i686
Byte-ordering:     little
Detected cores:    1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing only a light (yet comprehensive) subset of the test suite.
If you want a more complete test, try passing the '-heavy' flag to this
script (or set the 'heavy' parameter in case you are using bcolz.test()
call).  The whole suite will take more than 30 seconds to complete on a
relatively modern CPU and around 300 MB of RAM and 500 MB of disk
[32-bit platforms will always run significantly more lightly].

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
......s......s......s......s......................................ss..................................................................................................................F...F.......................F.............................................................................................................................................................................Segmentation fault (core dumped)
esc commented 10 years ago

Using some hacks and then nosetests, the last test to be executed is:

Testing carray constructor with an object `dtype`. ... Segmentation fault (core dumped)

Note however, that in my experience, the last thing to be printed before the segfault, may or may not indicated the source of the error.. :grin:

asmeurer commented 10 years ago

That's usually because stdout might not be flushed. Try running sys.stdout.flush() after each print.

You could also try using https://pypi.python.org/pypi/faulthandler/ to get a traceback.

esc commented 10 years ago

So I just had another look, it seems the segmentation fault no longer happens on current master. @ilanschnell could you perhaps try the current master and see if that fixes at least the segfault for you?

ilanschnell commented 10 years ago

I just tried running master on 32-bit Linux against Numpy 1.9. No more segfault! However some failures:

FAIL: test02 (test_carray.large_viewDiskTest)
Testing view() and iterators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/bcolz/tests/test_carray.py", line 654, in test02
    self.assertEqual(sum(a[3:]), sum(u))
AssertionError: 704982701 != 4999949997L

======================================================================
FAIL: test02 (test_carray.large_viewMemoryTest)
Testing view() and iterators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/bcolz/tests/test_carray.py", line 654, in test02
    self.assertEqual(sum(a[3:]), sum(u))
AssertionError: 704982701 != 4999949997L

======================================================================
FAIL: test02 (test_ctable.large_iterblocksDiskTest)
Testing `iterblocks` method with no stop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/bcolz/tests/test_ctable.py", line 1326, in test02
    self.assertEqual(s, (np.arange(blen - 1, N, dtype='f8') * 2).sum())
AssertionError: 20086674.0 != 100581168.0

======================================================================
FAIL: test02 (test_ctable.large_iterblocksMemoryTest)
Testing `iterblocks` method with no stop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/bcolz/tests/test_ctable.py", line 1326, in test02
    self.assertEqual(s, (np.arange(blen - 1, N, dtype='f8') * 2).sum())
AssertionError: 19966518.0 != 99980298.0

======================================================================
FAIL: test02 (test_ctable.small_iterblocksDiskTest)
Testing `iterblocks` method with no stop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/bcolz/tests/test_ctable.py", line 1326, in test02
    self.assertEqual(s, (np.arange(blen - 1, N, dtype='f8') * 2).sum())
AssertionError: 1070.0 != 4578.0

----------------------------------------------------------------------
Ran 873 tests in 4.798s

FAILED (failures=5, skipped=12)
esc commented 10 years ago

They all seem to be related to sum, perhaps an overflow problem?

esc commented 10 years ago

I opened a PR at #54 to collect fixes for the failing tests.

esc commented 10 years ago

So #54 fixes two, there are three left. @FrancescAlted what are the chances of fixing them and doing a new release for tomorrow as suggested by @asmeurer ?

FrancescAlted commented 10 years ago

Hmm, my schedule is pretty tight lately, but if you have time, go ahead.

esc commented 10 years ago

You can fix the remaining three failures this by forcing the dtype as in #56.

Curiously, it is a problem with numpy:

On 64 bit:

In [17]: ra = np.fromiter(((i, i * 2., i * 3) for i in xrange(120)), dtype='i4,f8,i8')

In [18]: ra[99:]['f1'].sum()
Out[18]: 4578.0

And on 32 bit:

In [2]: ra = np.fromiter(((i, i * 2., i * 3) for i in xrange(120)), dtype='i4,f8,i8')

In [3]: ra[99:]['f1'].sum()
Out[3]: 1070.0

Interestingly no overflow warning is issued, perhaps this should be expected?

And so, sum allows to spec the output dtype, which then gives us the same result as on 64 bit:

In [4]: ra[99:]['f1'].sum(dtype='f16')
Out[4]: 4578.0

Anyone for an explanation?

esc commented 10 years ago

I forgot to mention that by default, sum will return a type equal to the type of the self, so this case the middle field which is 'f8'.

esc commented 10 years ago

Hi, sorry for not getting back to you earlier. I believe we have fixed all the issues related on 32-bit platforms in the current master. Sorry also that we couldn't get a release out the door in time for the new Anaconda. Next time it would be good to give us all an earlier notice, like 3 or 4 weeks to get stuff prepped and ready and ideally have some time for users to play with the new release before it becomes enshrined in an Anaconda downloadable.

Closing for now, please open a new issue should new problems come to light.

asmeurer commented 10 years ago

The timing of this was bad because we had already planned the Anaconda release for the first of October, and NumPy 1.9.0 came out just a few weeks before. I would recommend keeping on top of new releases for the bcolz dependencies (perhaps even keep a Travis build against the master branches of each).

esc commented 10 years ago

FWIW: I bisected this and the segfault was fixed in fd8a77e4b78920c10a56e9f924de53aa48313696

esc commented 10 years ago

@asmeurer so 0.7.2 is out now! Where could I find the conda recipe for bcolz?