Added unit_columns items are saved as strings

luiztauffer commented 5 years ago

1) Bug

Added unit_columns items are saved as strings, when they're actually lists. In the code below, notice the printing values: the items should be lists, and are added correctly to the nwb object. But once saved, the become strings.

Steps to Reproduce

from pynwb import NWBFile, NWBHDF5IO
from datetime import datetime

# Make data source file
nwb = NWBFile('aa','aa', datetime.now().astimezone())
with pynwb.NWBHDF5IO('test.nwb', 'w') as io:
    nwb.add_unit()
    nwb.add_unit()
    nwb.add_unit()
    new_data = [['a'],['a','b'],['a','b','c']]
    nwb.add_unit_column(name='new_col', description='', data=new_data)
    print(nwb.units['new_col'][2])
    print(type(nwb.units['new_col'][2]))
    io.write(nwb)

io = pynwb.NWBHDF5IO('test.nwb', 'r') 
nwb = io.read()
print(nwb.units['new_col'][2])
print(type(nwb.units['new_col'][2]))
io.close()

This prints:

['a', 'b', 'c']
<class 'list'>
['a', 'b', 'c']
<class 'str'>

Checklist

[x] Have you ensured the feature or change was not already reported ?
[x] Have you included a brief and descriptive title?
[x] Have you included a clear description of the problem you are trying to solve?
[x] Have you included a minimal code snippet that reproduces the issue you are encountering?
[x] Have you checked our Contributing document?

bendichter commented 5 years ago

@ajtritt do you think you could help with this one?

ajtritt commented 5 years ago

At this point, I think the only solution to getting the behavior that you are expecting would be for us to inspect the data you pass in under the hood, and then make an assumption that the ragged nature of the passed in data means you want to index it. Not that I'm necessarily opposed to that, just mentioning what I think it would take. If we want to implement that feature, we should probably solicit feedback or hash this over on a Feature Request issue.

Fortunately for you @luiztauffer, there is a relatively painless work around. Try this:

from pynwb import NWBFile, NWBHDF5IO
from datetime import datetime

# Make data source file
nwb = NWBFile('aa','aa', datetime.now().astimezone())
with NWBHDF5IO('test.nwb', 'w') as io:
    nwb.add_unit()
    nwb.add_unit()
    nwb.add_unit()
################################
    # concatenate in a single list
    new_data = ['a','a','b','a','b','c']         
    # specify the index for the concatenated data
    nwb.add_unit_column(name='new_col', description='', data=new_data, index=[1, 3, 6])     
################################
    print(nwb.units['new_col'][2])
    print(type(nwb.units['new_col'][2]))
    io.write(nwb)

io = NWBHDF5IO('test.nwb', 'r')
nwb = io.read()

print(nwb.units['new_col'][2])
print(type(nwb.units['new_col'][2]))
io.close()

Should yield this:

['a', 'b', 'c']
<class 'list'>
['a' 'b' 'c']
<class 'numpy.ndarray'>

Let me know if that works

luiztauffer commented 5 years ago

thanks @ajtritt it works with that work around!

bendichter commented 5 years ago

@ajtriit thanks for the workaround! That's certainly lifts the immediate block.

I think ultimately though it is a bug for a list of strings to be automatically turned into a string of that list. Inspecting to see if an array is ragged and treating it accordingly seems like a good idea to me.

ajtritt commented 4 years ago

I think ultimately though it is a bug for a list of strings to be automatically turned into a string of that list.

@bendichter Yeah, this lies somewhere in the gray area between bug and undocumented gotcha. I think the original code only works because of how h5py handles strings. If you change the data to integers, you get a broadcast error.

Inspecting to see if an array is ragged and treating it accordingly seems like a good idea to me.

Full inspection is a costly operation, and I would like to avoid doing that every time only to support this edge case. If a user wants to pass in data like this, they will at least need to specify that it is irregularly shaped. Under the hood, the data can be unraveled and an index can be created. Something like this:

    new_data = [['a'],['a','b'],['a','b','c']]
    nwb.add_unit_column(name='new_col', description='', data=new_data, index=True)

luiztauffer commented 4 years ago

I think that ragged data will not be uncommon, the case in point is unit spike features (attributing a value for each spike from every unit), units will most of the times have different spike counts. I think that indexing as you suggested is quite straightforward for the user to deal with, but perhaps those particular cases should be better described at add_unit_column doc

oruebel commented 4 years ago

Full inspection is a costly operation, and I would like to avoid doing that every time only to support this edge case.

I agree. I think requiring that a user set index=True when adding a ragged column is fine.

perhaps those particular cases should be better described at add_unit_column doc

Agreed

NeurodataWithoutBorders / pynwb