Improve handling of ContinuousData, such as strings

This is a PR to fix (I think) the way ContinuousData is handled.

I created an example Task called ex_ContData to illustrate the issue. This task returns both TrialData and ContinuousData.

When I first ran it, I got the following errors:

Error 1
type object 'ContinuousData' has no attribute 'keys'
Traceback (most recent call last):
  File "/home/mouse/dev/autopilot/autopilot/core/gui.py", line 234, in create_subject
    subject_obj.assign_protocol(protocol_file, int(protocol_vals['step']))
  File "/home/mouse/dev/autopilot/autopilot/core/subject.py", line 548, in assign_protocol
    data_names = tuple(task_class.ContinuousData.keys())
AttributeError: type object 'ContinuousData' has no attribute 'keys'

Error 2
Traceback (most recent call last):
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.7/site-packages/tables/attributeset.py", line 546, in __getitem__
    return self.__getattr__(name)
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.7/site-packages/tables/attributeset.py", line 283, in __getattr__
    raise AttributeError(f"Attribute {name!r} does not exist "
AttributeError: Attribute 'data' does not exist in node: '/data/ex_ContData/S00_ex_ContData/continuous_data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mouse/dev/autopilot/autopilot/core/subject.py", line 817, in data_thread
    cont_data = continuous_group._v_attrs['data']
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.7/site-packages/tables/attributeset.py", line 551, in __getitem__
    % (name, self._v__nodepath))
KeyError: "Attribute ('data') does not exist in node '/data/ex_ContData/S00_ex_ContData/continuous_data'"

I believe this is a simple syntax error. task_class.ContinuousData.keys() does not work, it should be task_class.ContinuousData.columns.keys(). This is fixed by 920f3ed6bd285db07086fb019d257788464dd906

After this, I got a new error:

Error 3
ERROR:core.subject.Subject.exContData2:exception in data thread: unknown type: 'str416'
Traceback (most recent call last):
  File "/home/mouse/dev/autopilot/autopilot/core/subject.py", line 843, in data_thread
    col_atom = tables.Atom.from_type(v.dtype.name, v.shape)
AttributeError: 'str' object has no attribute 'dtype'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mouse/dev/autopilot/autopilot/core/subject.py", line 846, in data_thread
    col_atom = tables.Atom.from_type(temp_array.dtype.name, temp_array.shape)
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.7/site-packages/tables/atom.py", line 395, in from_type
    raise ValueError(f"unknown type: {type!r}")
ValueError: unknown type: 'str416'

Basically, when creating the ContinuousData group, it fails to properly detect the datatype of strings (when strings are returned as ContinuousData). This is because strings do not have a dtype field. Because this whole thing is wrapped in a try/except, it simply passes silently, and doesn't store any continuous data.

This also interacts super weirdly with what I think is a bug in pytables. tables.Atom.from_type simply does not work on strings, as far as I can tell. You can see this even in their own docstrings, which contain an error, I guess they are using some code that inserts the result of running itself into the docstrings (which in this case is an error). https://www.pytables.org/_modules/tables/atom.html#Atom.from_type

The only way I could figure around this is to detect if the returned data is a string, and then use tables.StringAtom instead. The fix is in dbc585df2b53b179b5e9af0cf1ccf80629c80788

This actually works. I can now see ContinuousData!!

In [89]: fi.root.data['ex_ContData']['S00_ex_ContData']['continuous_data']['session_1']['string_data'][:]
Out[89]: 
array([(b'a long string', b'2022-03-31T14:27:34.048789'),
       (b'a long string', b'2022-03-31T14:27:43.087934')],
      dtype=[('string_data', 'S13'), ('timestamp', 'S26')])

In [90]: fi.root.data['ex_ContData']['S00_ex_ContData']['continuous_data']['session_1']['int_data'][:]
Out[90]: 
array([(3, b'2022-03-31T14:27:34.048789'),
       (3, b'2022-03-31T14:27:43.087934')],
      dtype=[('int_data', '<i8'), ('timestamp', 'S26')])

A remaining issue -- I don't know how to get the intended length of the string from ContinuousData column. Instead, I just use the length of the string that is passed -- but this will fail if a subsequent string is longer. Probably this is a simple fix. In general, we should probably be creating these Atom using the ContinuousData description itself, not the datatype of the provided data.

Also, just some questions about how ContinuousData is meant to work:

Is it meant to be possible to return a numpy.array as a form of ContinuousData? I don't see how it'd be feasible to serialize this over a networking message. This would also imply that the ContinuousData could contain vlarray instead of simple static datatypes. I'm just wondering because the code as written seems to assume that the returned data might already be a numpy array.
Is it meant to be the case that if multiple variables of ContinuousData are returned in a single message, they are all stored in separate recarrays in the ContinuousData group? (As in the example above with "string_data" and "int_data".) I would have thought they might be stored in a single table with multiple columns. (As if the example above produced one table with columns "string_data", "int_data", and "timestamp".) So kind of like TrialData, but with each message as a row instead of each trial as a row.

Oh, also is there a reason we're leaving strings as bytestrings (b"asdf") instead of decoding them to utf-8 ("asdf")?

related to some of what i was writing about here last night: https://github.com/wehr-lab/autopilot/issues/67

basically the subject class is in the middle of a complete overhaul here: https://github.com/wehr-lab/autopilot/tree/detour-datamodel

that i'm trying to finish today or tomorrow. I think it might be a good idea to revisit this after that gets merged, as changing the way that data is declared in general is the goal and things will look pretty different internally (though the TrialData ContinuousData syntax will be backwards compatible). Wrote a bit more about the details in that issue^^

In the meantime:

Basically, when creating the ContinuousData group, it fails to properly detect the datatype of strings (when strings are returned as ContinuousData). This is because strings do not have a dtype field.

I think that a good way to do this is to have a mapping between python types and tables types, so rather than relying on data.dtype we do type(data) and map from there. This is the approach i'm taking in the current rewrite at least. That should be a fallback behavior, instead favoring explicit declarations, as you also mention, but ya.

The only way I could figure around this is to detect if the returned data is a string, and then use tables.StringAtom instead.

basically this^ but for all types ya.

Because this whole thing is wrapped in a try/except, it simply passes silently, and doesn't store any continuous data.

In general no data should be dropped silently, totally agree this is a bug.

A remaining issue -- I don't know how to get the intended length of the string from ContinuousData column. Instead, I just use the length of the string that is passed -- but this will fail if a subsequent string is longer. Probably this is a simple fix. In general, we should probably be creating these Atom using the ContinuousData description itself, not the datatype of the provided data.

ContinuousData is a bit different than TrialData because you can't assume that each stream of continuous data will be sampled at the same rate, so each is written to its own array (rather than a table). In this case, we should use a VLUnicodeAtom which lets you save variable length strings, but they have some interesting caveats like only being able to have one of them per row. tables/hdf5 lets you natively encode timestamps for every row, so it should just be a bit of wrapping that functionality to be able to have timestampped continuous data for variable length strings. Otherwise we can serialize json.dumps({'timestamp':datetime.now().isoformat(), value:data_string}) or something.

Is it meant to be possible to return a numpy.array as a form of ContinuousData? I don't see how it'd be feasible to serialize this over a networking message.

Serializing arrays is supported! when the message is serialized it uses a _serialize_numpy method as its default (aka the method that's called when json doesn't know how to serialize): https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/networking/message.py#L265

which uses blosc to compress the array and then b64encode it: https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/networking/message.py#L167

compression before send used to be faster on the raspi3 becuase its network card was very limited, but we should do some perf testing to see if that's still the case (or at least make it optional/detect which is faster depending on the arrays being sent).

I also want to switch from using json to using something like msgpack because python builtin json is notoriously slow... but one thing at a time.

This would also imply that the ContinuousData could contain vlarray instead of simple static datatypes. I'm just wondering because the code as written seems to assume that the returned data might already be a numpy array.

Yeah there's too much assumption about the typing all the way through data handling, one of the major reasons i'm rewriting it to be more strictly and transparently typed.

Is it meant to be the case that if multiple variables of ContinuousData are returned in a single message, they are all stored in separate recarrays in the ContinuousData group? (As in the example above with "string_data" and "int_data".) I would have thought they might be stored in a single table with multiple columns. (As if the example above produced one table with columns "string_data", "int_data", and "timestamp".) So kind of like TrialData, but with each message as a row instead of each trial as a row.

This is sort of a tricky question: in my opinion all of this should be made abstract behind the Subject class, so the literal arrangement of data in the hdf5 file shouldn't really matter all that much, as long as an API to declare, save, and load are exposed. As exists currently, the declaration of both trial and continuous data is flat: effectively a list of variables and types, but with the shift to using more explicit data models it should be possible to make recursive models like this

class Tetrode(Data):
    timestamp: datetime
    channel1: float
    channel2: float
    channel3: float
    channel4: float

class LickSensor(Data):
    timestamp: datetime
    state: bool

class ContinuousData(ContinuousDataModel):
    tetrode: Tetrode
    lick: LickSensor
    accelerometer: float

where the Tetrode data would be stored as multiple columns in a single array, but the accelerometer would be on its own, etc.

To me storing several arrays with identical timestamps, each having a single data stream is a bit more flexible than trying to pack multiple streams of data in a particular table, but i think it should be an option if the streams are truly coupled.

Oh, also is there a reason we're leaving strings as bytestrings (b"asdf") instead of decoding them to utf-8 ("asdf")?

this a pytables/hdf5 thing: https://www.pytables.org/MIGRATING_TO_3.x.html

Unicode all the strings!

In Python 3, all strings are natively in Unicode. This introduces some difficulties, as the native HDF5 string format is not Unicode-compatible. To minimize explicit conversion troubles when writing, especially when creating data sets from existing Python objects, string objects are implicitly cast to non-Unicode for HDF5 storage. To make you aware of this, a warning is raised when this happens.

This is certainly no true Unicode compatibility, but mainly for convenience with the pure-Unicode Python 3 string type. Any string that is not castable as ascii upon creation of your data set, will hence still raise an error. For true Unicode support, look into the VLUnicodeAtom class.

which i'm handling like this, as pandas has a pretty fast vectorized bytestring encoding method

https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/subject.py#L376-L379

sorry for the delay, but I really really really need to turn to writing my dissertation so I won't be able to get to this for a bit. :\ i want to pull in the changes to subject as v0.5.0 before resubmitting the autopilot manuscript though so it won't be indefinite.

auto-pi-lot / autopilot

Improve handling of ContinuousData, such as strings #172

Unicode all the strings!