gallantlab / cottoncandy

sugar for s3
http://gallantlab.github.io/cottoncandy/
BSD 2-Clause "Simplified" License
33 stars 16 forks source link

Bug with storing raw arrays #76

Open TomDLT opened 4 years ago

TomDLT commented 4 years ago

With @gxlilyBerkeley, we identified a bug when storing raw arrays, which was quite annoying.

If we upload some arrays and download them immediately, they might be changed in the process. We narrowed down the bug to the following conditions:

Here is a reproducing example:

import sys
import numpy as np
import cottoncandy as cc

# the bug is only in python 2
print(sys.version_info)

# create some large sparse data to be compressed
array_in = np.zeros((100, 100, 30000)).T
mask = np.random.randn(*array_in.shape) > 3
array_in[mask] = 1

# there is no bug if the array is contiguous in memory
if False:
    array_in = np.ascontiguousarray(array_in)

# store and load the array, using raw arrays
cci = cc.get_interface('tomdlt_tmp', verbose=False)
cci.upload_raw_array("test_array", array_in)
array_out = cci.download_raw_array("test_array")

# array_in and array_out are not identical
np.testing.assert_array_equal(array_in, array_out)
anwarnunez commented 4 years ago

ooof, that is annoying. couple of things: is the "immediate" part relevant? if you store it, exit the session, and then download it in a separate session, does it work? and what is the difference in the data? is it totally corrupted? or is it an exactness/dtype problem?

TomDLT commented 4 years ago

The immediate part was just to sort out a possible corruption at the storage level, but we discovered the issue on an array stored for a long time.

The example above is made to reproduce with synthetic data, but on the original array where we discovered the bug, we got a pretty big difference in the data:

Compressed to 0.06% the size

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-53e61b81e11e> in <module>()
      6 cci.upload_raw_array("20200409_cc_test", array_in)
      7 array_out = cci.download_raw_array("20200409_cc_test")
----> 8 np.testing.assert_array_equal(array_in, array_out)
      9 print('success !!')

/home/jlg/tomdlt/miniconda3/envs/py27/lib/python2.7/site-packages/numpy/testing/_private/utils.pyc in assert_array_equal(x, y, err_msg, verbose)
    902     __tracebackhide__ = True  # Hide traceback for py.test
    903     assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
--> 904                          verbose=verbose, header='Arrays are not equal')
    905 
    906 

/home/jlg/tomdlt/miniconda3/envs/py27/lib/python2.7/site-packages/numpy/testing/_private/utils.pyc in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    825                                 verbose=verbose, header=header,
    826                                 names=('x', 'y'), precision=precision)
--> 827             raise AssertionError(msg)
    828     except ValueError:
    829         import traceback

AssertionError: 
Arrays are not equal

Mismatch: 18%
Max absolute difference: 60.51625653
Max relative difference: 36.06609662
 x: array([[[64.188949, 61.733371, 62.085814, ..., 62.085814, 62.085814,
         63.139827],
        [60.672661, 57.098978, 57.459062, ..., 57.459062, 57.098978,...
 y: array([[[64.188949, 64.188949, 64.188949, ..., 64.188949, 64.188949,
         64.188949],
        [64.188949, 64.188949, 64.188949, ..., 64.188949, 64.188949,...
TomDLT commented 4 years ago

Interestingly, the histograms of values are identical, which means that it is probably just a reordering of values, coming from an incorrect handling of non-contiguous arrays.

hist_in = np.histogram(array_in.ravel(), bins=100)
hist_out = np.histogram(array_out.ravel(), bins=100)

np.testing.assert_array_equal(hist_in[0], hist_out[0])
np.testing.assert_array_equal(hist_in[1], hist_out[1])
TomDLT commented 4 years ago

This is confirmed by the following test, which passes (assuming that array_in is 2D and FORTRAN-ordered):

np.testing.assert_array_equal(array_in, array_out.reshape(array_out.shape[::-1]).T)

It seems that the array is stored with the correct data and shape, but the strides/ordering are lost in the process. An easy fix would be to use np.ascontiguousarray before the storing, with the inconvenience of a data copy.

anwarnunez commented 4 years ago

just to follow up from our conversation. this can be fixed in this code block: https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/interfaces.py#L603

A more solid solution is to remove PY2 support.

We also need to add a test for large arrays.