NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
176 stars 85 forks source link

NWBData append and extend behavior depends on underlying type #727

Open NileGraddis opened 5 years ago

NileGraddis commented 5 years ago

Here is a script that replicates this issue (python 3.6, commit 32f3b037d4ce8771e68c0f091731462c65d3ae4c):

import pynwb
import numpy as np

def main():

    iters = 5
    iter_length = 3

    data = pynwb.core.NWBData(
        name='some_name',
        data=np.array([])
    )
    for ii in range(iters):
        data.append(np.arange(iter_length))
    print(data.data)

    data = pynwb.core.NWBData(
        name='some_name',
        data=[]
    )
    for ii in range(iters):
        data.append(np.arange(iter_length))
    print(data.data)

    data = pynwb.core.NWBData(
        name='some_name',
        data=np.array([])
    )
    for ii in range(iters):
        data.extend(np.arange(iter_length))
    print(data.data)

    data = pynwb.core.NWBData(
        name='some_name',
        data=[]
    )
    for ii in range(iters):
        data.extend(np.arange(iter_length))
    print(data.data)

if __name__ == '__main__':
    main()

This produces:

[0. 1. 2. 0. 1. 2. 0. 1. 2. 0. 1. 2. 0. 1. 2.]
[array([0, 1, 2]), array([0, 1, 2]), array([0, 1, 2]), array([0, 1, 2]), array([0, 1, 2])]
[0. 1. 2. 0. 1. 2. 0. 1. 2. 0. 1. 2. 0. 1. 2.]
[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]

This is odd because the second case here is a conventional representation of a 2D array, while the first is a representation of a 1D array, but the only distinction is whether the NWBData object was constructed with a list or a numpy array as its underlying data store. Similar examples can be constructed for higher-dimensional data (appending a 2D numpy array to a list adds one element, calling np.append flattens both and returns a 1D array).

The core problem here is that instead of ETLing the data into a common internal representation, NWBData maintains these types internally and dispatches further operations by the stored type. This basically creates a case where there are multiple implementations of the same operation with potentially divergent behavior.

Another issue: NWBData is (to the best of my knowledge) the generic pynwb representation of an n-dimensional array. In this case it seems like append and extend require further definition. What ought I to get if I call append or extend on an Image, for instance?

bendichter commented 5 years ago

Thanks for this @NileGraddis. It is pretty unintuitive that initializing the data as either a np.ndarray or list will change the behavior of pynwb.core.nwb_data.append. Just so we understand the urgency, are you able to get around this UI glitch? I think this is one of those situations where it could be a feature or a bug depending on how you look at it. The divergent behavior is a bit counter-intuitive but could be useful in some cases. If we do want to make this uniform, I suppose we have two options:

1) Recast data as either a list or an np.ndarray 2) Create a NWBData.append method that respects these two different data types

I don't have a preference.

BTW, are you actually calling NWBData.append? If you are trying to store spike times, it might be more convenient to use nwbfile.add_unit(spike_times=).

oruebel commented 5 years ago
2\. Create a `NWBData.append` method

As far as I can tell he is using the existing NWBData.appendand NWBData.extend methods here. I have not looked closely at this, but at first glance this looks like a bug in NWBData.append.

1. Recast data as either a `list` or an `np.ndarray`

Currently we allow [np.ndarray, list, tuple, h5py.Dataset] as array data which is really what causes this issue. Converting list or tuple to numpy implicitly is problematic because it requires us to make copies of the data. If we require that data be copied from list to numpy then this should be done explicitly, i.e., the user should then have to do this so that they are aware of what is happening. I.e., the more appropriate solution at that point would be to simply not allow list/tuple as input and to allow only [np.ndarray, h5py.Dataset] as types for input arrays. This is ultimately a question of flexibility vs. consistency. I'm not arguing for either approach right now---i.e., allow lists for flexibility or enforce use of numpy arrays for consistency---all I'm trying to say is that we should make this choice explicit rather than copying data under the hood where the user does not directly see it and has limited control over the behavior.