datajoint / datajoint-python

Relational data pipelines for the science lab
https://datajoint.com/docs
GNU Lesser General Public License v2.1
170 stars 85 forks source link

Cell array of arrays of doubles cannot be fetched in python, only in matlab #1098

Open renanmcosta opened 1 year ago

renanmcosta commented 1 year ago

Bug Report

Description

Fetching fails in python when each entry for a given attribute (defined in matlab) is a cell array, and each element of the cell array is an array of doubles. Fetching in matlab works as expected.

Reproducibility

Windows, Python 3.9.13, DataJoint 0.13.8

Steps:

  1. Define and populate table in matlab containing an attribute such as: epoch_pos_range=null : blob # list of y position ranges corresponding to n epochs in epoch_list, (e.g., {[y_on y_off],[y_on y_off]} for epoch_list {'epoch1','epoch2'})
  2. Fetch in matlab (works as intended)
  3. Attempt to fetch in python (throws a reshaping error for the array)

Error stack:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
----> 1 VM['opto'].OptoSession.fetch('epoch_pos_range')

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\fetch.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/fetch.py) in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
    227             attributes = [a for a in attrs if not is_key(a)]
    228             ret = self._expression.proj(*attributes)
--> 229             ret = ret.fetch(
    230                 offset=offset,
    231                 limit=limit,

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\fetch.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/fetch.py) in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
    287                 for name in heading:
    288                     # unpack blobs and externals
--> 289                     ret[name] = list(map(partial(get, heading[name]), ret[name]))
    290                 if format == "frame":
    291                     ret = pandas.DataFrame(ret).set_index(heading.primary_key)

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\fetch.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/fetch.py) in _get(connection, attr, data, squeeze, download_path)
    108         if attr.uuid
    109         else (
--> 110             blob.unpack(
    111                 extern.get(uuid.UUID(bytes=data)) if attr.is_external else data,
    112                 squeeze=squeeze,

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in unpack(blob, squeeze)
    603         return blob
    604     if blob is not None:
--> 605         return Blob(squeeze=squeeze).unpack(blob)

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in unpack(self, blob)
    127         blob_format = self.read_zero_terminated_string()
    128         if blob_format in ("mYm", "dj0"):
--> 129             return self.read_blob(n_bytes=len(self._blob) - self._pos)
    130 
    131     def read_blob(self, n_bytes=None):

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in read_blob(self, n_bytes)
    161                 % data_structure_code
    162             )
--> 163         v = call()
    164         if n_bytes is not None and self._pos - start != n_bytes:
    165             raise DataJointError("Blob length check failed! Invalid blob")

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in read_cell_array(self)
    493         return (
    494             self.squeeze(
--> 495                 np.array(result).reshape(shape, order="F"), convert_to_scalar=False
    496             )
    497         ).view(MatCell)

ValueError: cannot reshape array of size 4 into shape (1,2)
kabilar commented 1 year ago

Thanks for the report, @renanmcosta. Typically the MATLAB cell array gets properly packed and unpacked. We have not encountered the error that you reported. We will investigate further and get back to you.

renanmcosta commented 1 year ago

For now I've managed to fetch with the temporary fix below. I don't think it's very robust, but I'm copying it here in case it's informative.


def read_cell_array(self):
        """deserialize MATLAB cell array"""
        n_dims = self.read_value()
        shape = self.read_value(count=n_dims)
        n_elem = int(np.prod(shape))
        result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
        if n_elem != len(np.ravel(result, order="F")): # if not all elements are scalars. shouldn't work for ragged arrays
            shape = (-1,) + tuple(shape[1:n_dims])
        return (
            self.squeeze(
                np.array(result).reshape(shape, order="F"), convert_to_scalar=False
            )
        ).view(MatCell)
Paschas commented 8 months ago

Greetings,

I have just encountered the same problem, and temp fix seems to work (Thanks a lot @renanmcosta)


Temporary fix returns an array but with shape = (537000, 2).
In matlab its an 1×2 cell array {10×5370×10 single} {10×5370×10 single}.

type(temp_fixed) --> datajoint.blob.MatCell


Am I able to retrieve the original dimensions or this is a robustness problem of the temporary fix?

Thanks in advance

dimitri-yatsenko commented 2 months ago

Hi @Paschas, could you update us on this? We are looking to resolve this.

renanmcosta commented 2 months ago

Greetings,

I have just encountered the same problem, and temp fix seems to work (Thanks a lot @renanmcosta)

Temporary fix returns an array but with shape = (537000, 2). In matlab its an 1×2 cell array {10×5370×10 single} {10×5370×10 single}.

type(temp_fixed) --> datajoint.blob.MatCell

Am I able to retrieve the original dimensions or this is a robustness problem of the temporary fix?

Thanks in advance

The temp fix is responsible for the shape differences there. Lately, I have been using a simpler fix, which shouldn't collapse any dimensions. This is one should always work, though it's possible that it can lead to awkward array nesting at times.

def fix_cell_array_fetch():
    """Fixes bug that prevents cell arrays from being fetched in python in certain
    cases. Replaces cell array unpacking method in the datajoint module with working
    version.
    """

    class Blob(dj.blob.Blob):
        def read_cell_array(self):
            """deserialize MATLAB cell array"""
            n_dims = self.read_value()
            shape = self.read_value(count=n_dims)
            n_elem = int(np.prod(shape))
            result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
            return (
                self.squeeze(np.array(result, dtype="object"), convert_to_scalar=False)
            ).view(dj.blob.MatCell)

    dj.blob.Blob = Blob
dimitri-yatsenko commented 2 months ago

Let's see if we can incorporate this in this coming release.

Paschas commented 2 months ago

Greetings @dimitri-yatsenko & @renanmcosta

Without @renanmcosta's fixes I used to get 2 types of error:

in Blob.read_cell_array(self)
    [493] n_elem = int(np.prod(shape))
    [494] result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
    [495] return (
    [496]     self.squeeze(
    [497]         #np.array(result).reshape(shape, order="F"), convert_to_scalar=False
    [498]         #np.array(result).reshape(shape, order="C"), convert_to_scalar=False
--> [499]         np.array(result).reshape(shape, order="A"), convert_to_scalar=False
    [500]
    [501])
    [502].view(MatCell)

ValueError: cannot reshape array of size 2560 into shape (1,10)

or

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.

The fix_cell_array_fetch() is working but I would be cautious (Thanks again @renanmcosta )

In different but similar occasion arrays had the correct shape but data were shuffled, eventually a changed the following:

# line 243 of blob.py
 def read_array(self):
        .....        
       # Changed Nothing
        .....        
        return self.squeeze(data.reshape(shape, order="C"))  # It was F
renanmcosta commented 1 week ago

We just found a new case where the latest approach I posted above still raises a ValueError, e.g.: ValueError: could not broadcast input array from shape (3,) into shape (1,) It happens when the first dimension of each entry is the same, and appears to be a limitation of numpy (discussion). Ultimately the problem is that MATLAB cell arrays and numpy arrays are intended as different types of objects, and as a result MATLAB cell arrays can be ragged in ways that numpy is unwilling to support. Here's my current solution, which should hopefully retain the structure of each entry:

class fixed_Blob(dj.blob.Blob):
    def read_cell_array(self):
        """deserialize MATLAB cell array"""
        n_dims = self.read_value()
        shape = self.read_value(count=n_dims)
        n_elem = int(np.prod(shape))
        result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
        arr = np.empty(n_elem, dtype="object")
        arr[:] = result
        return (self.squeeze(arr, convert_to_scalar=False)).view(dj.blob.MatCell)