Open TomDLT opened 4 years ago
ooof, that is annoying. couple of things: is the "immediate" part relevant? if you store it, exit the session, and then download it in a separate session, does it work? and what is the difference in the data? is it totally corrupted? or is it an exactness/dtype problem?
The immediate part was just to sort out a possible corruption at the storage level, but we discovered the issue on an array stored for a long time.
The example above is made to reproduce with synthetic data, but on the original array where we discovered the bug, we got a pretty big difference in the data:
Compressed to 0.06% the size
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-11-53e61b81e11e> in <module>()
6 cci.upload_raw_array("20200409_cc_test", array_in)
7 array_out = cci.download_raw_array("20200409_cc_test")
----> 8 np.testing.assert_array_equal(array_in, array_out)
9 print('success !!')
/home/jlg/tomdlt/miniconda3/envs/py27/lib/python2.7/site-packages/numpy/testing/_private/utils.pyc in assert_array_equal(x, y, err_msg, verbose)
902 __tracebackhide__ = True # Hide traceback for py.test
903 assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
--> 904 verbose=verbose, header='Arrays are not equal')
905
906
/home/jlg/tomdlt/miniconda3/envs/py27/lib/python2.7/site-packages/numpy/testing/_private/utils.pyc in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
825 verbose=verbose, header=header,
826 names=('x', 'y'), precision=precision)
--> 827 raise AssertionError(msg)
828 except ValueError:
829 import traceback
AssertionError:
Arrays are not equal
Mismatch: 18%
Max absolute difference: 60.51625653
Max relative difference: 36.06609662
x: array([[[64.188949, 61.733371, 62.085814, ..., 62.085814, 62.085814,
63.139827],
[60.672661, 57.098978, 57.459062, ..., 57.459062, 57.098978,...
y: array([[[64.188949, 64.188949, 64.188949, ..., 64.188949, 64.188949,
64.188949],
[64.188949, 64.188949, 64.188949, ..., 64.188949, 64.188949,...
Interestingly, the histograms of values are identical, which means that it is probably just a reordering of values, coming from an incorrect handling of non-contiguous arrays.
hist_in = np.histogram(array_in.ravel(), bins=100)
hist_out = np.histogram(array_out.ravel(), bins=100)
np.testing.assert_array_equal(hist_in[0], hist_out[0])
np.testing.assert_array_equal(hist_in[1], hist_out[1])
This is confirmed by the following test, which passes (assuming that array_in
is 2D and FORTRAN-ordered):
np.testing.assert_array_equal(array_in, array_out.reshape(array_out.shape[::-1]).T)
It seems that the array is stored with the correct data and shape, but the strides/ordering are lost in the process. An easy fix would be to use np.ascontiguousarray
before the storing, with the inconvenience of a data copy.
just to follow up from our conversation. this can be fixed in this code block: https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/interfaces.py#L603
A more solid solution is to remove PY2 support.
We also need to add a test for large arrays.
With @gxlilyBerkeley, we identified a bug when storing raw arrays, which was quite annoying.
If we upload some arrays and download them immediately, they might be changed in the process. We narrowed down the bug to the following conditions:
Here is a reproducing example: