Blosc / bloscpack

Command line interface to and serialization format for Blosc
BSD 3-Clause "New" or "Revised" License
122 stars 27 forks source link

test_append.test_append_mix_shuffle error #76

Closed toddrme2178 closed 5 years ago

toddrme2178 commented 6 years ago

I am trying to package bloscpack for openSUSE and I am getting the following error:

======================================================================
FAIL: test_append.test_append_mix_shuffle
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/abuild/rpmbuild/BUILD/bloscpack-0.13.0/test/test_append.py", line 366, in test_append_mix_shuffle
    nt.assert_equal(blosc_header_last['flags'], 0)
AssertionError: 2 != 0

----------------------------------------------------------------------

This is with:

Any idea what might be going wrong?

esc commented 5 years ago

I am assuming that this means that the buffer is a pure-memory copy:

https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst

I.e. bit 1 is set - this might happen when things change internally in Blosc. The decompressed and compressed result should be untouched though. It might make sense to simply use a larger array in the test case or perhaps upgrade the version of blosc.

Maybe @FrancescAlted can shed some more light on this?

esc commented 5 years ago

I am seeing this now with the recently released blosc 1.6.1. Some more testing confirmed, that the issue is indeed a pure-mem-copy issue.

FrancescAlted commented 5 years ago

Yes, it looks like a pure memory-copy to me. If the question is why sometimes the pure memory-copy bit activates and others it doesn't, my guess is that codecs undergo changes with the different C-Blosc releases; these changes typically affect compression ratio (albeit slightly in general), so it may happen that sometimes the compression ratio is not enough and Blosc fallback to the pure memory-copy instead.

esc commented 5 years ago

O.K. I have been debugging this issue further and it turns out, that this phenomenon appears only, because the byte_shuffe filter is deactivated. If I reactivate the byte shuffle, the data is no longer a pure-memory-copy.

esc commented 5 years ago

So, I have managed to reproduce a standalone snippet of this behaviour, using python-blosc 1.6.1:

In [17]: import blosc                                                                                                                                                                                       

In [18]: import numpy as np                                                                                                                                                                                 

In [19]: from bloscpack.headers import decode_blosc_header, decode_blosc_flags                                                                                                                              

In [20]: a = np.linspace(0, 1, 1e6)                                                                                                                                                                         
/home/esc/git/bloscpack/.venv3/bin/ipython:1: DeprecationWarning: object of type <class 'float'> cannot be safely interpreted as an integer.
  #!/home/esc/git/bloscpack/.venv3/bin/python3

In [21]: b = a.tobytes()                                                                                                                                                                                    

In [22]: c = blosc.compress(b, shuffle=False)                                                                                                                                                               

In [23]: decode_blosc_header(c[:16])                                                                                                                                                                        
Out[23]: 
OrderedDict([('version', 2),
             ('versionlz', 1),
             ('flags', 2),
             ('typesize', 8),
             ('nbytes', 8000000),
             ('blocksize', 524288),
             ('ctbytes', 8000016)])

In [24]: c = blosc.compress(b, shuffle=True)                                                                                                                                                                

In [25]: decode_blosc_header(c[:16])                                                                                                                                                                        
Out[25]: 
OrderedDict([('version', 2),
             ('versionlz', 1),
             ('flags', 1),
             ('typesize', 8),
             ('nbytes', 8000000),
             ('blocksize', 524288),
             ('ctbytes', 969806)])

So, maybe linspace isn't the best input for this test.

esc commented 5 years ago

For the record, here is the same example, but with python-blosc 1.5.1:

In [1]: import blosc                                                                                                                                                                                        

In [2]: import numpy as np                                                                                                                                                                                  

In [3]: from bloscpack.headers import decode_blosc_header, decode_blosc_flags                                                                                                                               

In [4]: a = np.linspace(0, 1, 1e6)                                                                                                                                                                          
/home/esc/git/bloscpack/.venv3/bin/ipython:1: DeprecationWarning: object of type <class 'float'> cannot be safely interpreted as an integer.
  #!/home/esc/git/bloscpack/.venv3/bin/python3

In [5]: b = a.tobytes()                                                                                                                                                                                     

In [6]: c = blosc.compress(b, shuffle=False)                                                                                                                                                                

In [7]: decode_blosc_header(c[:16])                                                                                                                                                                         
Out[7]: 
OrderedDict([('version', 2),
             ('versionlz', 1),
             ('flags', 0),
             ('typesize', 8),
             ('nbytes', 8000000),
             ('blocksize', 524288),
             ('ctbytes', 7988999)])

In [8]: c = blosc.compress(b, shuffle=True)                                                                                                                                                                 

In [9]: decode_blosc_header(c[:16])                                                                                                                                                                         
Out[9]: 
OrderedDict([('version', 2),
             ('versionlz', 1),
             ('flags', 1),
             ('typesize', 8),
             ('nbytes', 8000000),
             ('blocksize', 524288),
             ('ctbytes', 585276)])

So, as you can see, using shuffle just barely compresses the input, so my gut feeling is that the threshold for detecting has been lowered.

Probably it makes most sense to use different input data in this case.

esc commented 5 years ago

There is a proposed fix here: #82

This will also solve the current build failures on master.

esc commented 5 years ago

This is fixed in master with #82