Open Rodgers-PAC-Lab opened 2 years ago
Oh, also is there a reason we're leaving strings as bytestrings (b"asdf") instead of decoding them to utf-8 ("asdf")?
related to some of what i was writing about here last night: https://github.com/wehr-lab/autopilot/issues/67
basically the subject class is in the middle of a complete overhaul here: https://github.com/wehr-lab/autopilot/tree/detour-datamodel
that i'm trying to finish today or tomorrow. I think it might be a good idea to revisit this after that gets merged, as changing the way that data is declared in general is the goal and things will look pretty different internally (though the TrialData
ContinuousData
syntax will be backwards compatible). Wrote a bit more about the details in that issue^^
In the meantime:
Basically, when creating the ContinuousData group, it fails to properly detect the datatype of strings (when strings are returned as ContinuousData). This is because strings do not have a dtype field.
I think that a good way to do this is to have a mapping between python types and tables types, so rather than relying on data.dtype
we do type(data)
and map from there. This is the approach i'm taking in the current rewrite at least. That should be a fallback behavior, instead favoring explicit declarations, as you also mention, but ya.
The only way I could figure around this is to detect if the returned data is a string, and then use tables.StringAtom instead.
basically this^ but for all types ya.
Because this whole thing is wrapped in a try/except, it simply passes silently, and doesn't store any continuous data.
In general no data should be dropped silently, totally agree this is a bug.
A remaining issue -- I don't know how to get the intended length of the string from ContinuousData column. Instead, I just use the length of the string that is passed -- but this will fail if a subsequent string is longer. Probably this is a simple fix. In general, we should probably be creating these Atom using the ContinuousData description itself, not the datatype of the provided data.
ContinuousData
is a bit different than TrialData
because you can't assume that each stream of continuous data will be sampled at the same rate, so each is written to its own array (rather than a table). In this case, we should use a VLUnicodeAtom
which lets you save variable length strings, but they have some interesting caveats like only being able to have one of them per row. tables/hdf5 lets you natively encode timestamps for every row, so it should just be a bit of wrapping that functionality to be able to have timestampped continuous data for variable length strings. Otherwise we can serialize json.dumps({'timestamp':datetime.now().isoformat(), value:data_string})
or something.
Is it meant to be possible to return a numpy.array as a form of ContinuousData? I don't see how it'd be feasible to serialize this over a networking message.
Serializing arrays is supported! when the message is serialized it uses a _serialize_numpy
method as its default (aka the method that's called when json doesn't know how to serialize): https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/networking/message.py#L265
which uses blosc
to compress the array and then b64encode
it: https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/networking/message.py#L167
compression before send used to be faster on the raspi3 becuase its network card was very limited, but we should do some perf testing to see if that's still the case (or at least make it optional/detect which is faster depending on the arrays being sent).
I also want to switch from using json to using something like msgpack because python builtin json is notoriously slow... but one thing at a time.
This would also imply that the ContinuousData could contain vlarray instead of simple static datatypes. I'm just wondering because the code as written seems to assume that the returned data might already be a numpy array.
Yeah there's too much assumption about the typing all the way through data handling, one of the major reasons i'm rewriting it to be more strictly and transparently typed.
Is it meant to be the case that if multiple variables of ContinuousData are returned in a single message, they are all stored in separate recarrays in the ContinuousData group? (As in the example above with "string_data" and "int_data".) I would have thought they might be stored in a single table with multiple columns. (As if the example above produced one table with columns "string_data", "int_data", and "timestamp".) So kind of like TrialData, but with each message as a row instead of each trial as a row.
This is sort of a tricky question: in my opinion all of this should be made abstract behind the Subject class, so the literal arrangement of data in the hdf5 file shouldn't really matter all that much, as long as an API to declare, save, and load are exposed. As exists currently, the declaration of both trial and continuous data is flat: effectively a list of variables and types, but with the shift to using more explicit data models it should be possible to make recursive models like this
class Tetrode(Data):
timestamp: datetime
channel1: float
channel2: float
channel3: float
channel4: float
class LickSensor(Data):
timestamp: datetime
state: bool
class ContinuousData(ContinuousDataModel):
tetrode: Tetrode
lick: LickSensor
accelerometer: float
where the Tetrode
data would be stored as multiple columns in a single array, but the accelerometer
would be on its own, etc.
To me storing several arrays with identical timestamps, each having a single data stream is a bit more flexible than trying to pack multiple streams of data in a particular table, but i think it should be an option if the streams are truly coupled.
Oh, also is there a reason we're leaving strings as bytestrings (b"asdf") instead of decoding them to utf-8 ("asdf")?
this a pytables/hdf5 thing: https://www.pytables.org/MIGRATING_TO_3.x.html
Unicode all the strings!
In Python 3, all strings are natively in Unicode. This introduces some difficulties, as the native HDF5 string format is not Unicode-compatible. To minimize explicit conversion troubles when writing, especially when creating data sets from existing Python objects, string objects are implicitly cast to non-Unicode for HDF5 storage. To make you aware of this, a warning is raised when this happens.
This is certainly no true Unicode compatibility, but mainly for convenience with the pure-Unicode Python 3 string type. Any string that is not castable as ascii upon creation of your data set, will hence still raise an error. For true Unicode support, look into the VLUnicodeAtom class.
which i'm handling like this, as pandas has a pretty fast vectorized bytestring encoding method
sorry for the delay, but I really really really need to turn to writing my dissertation so I won't be able to get to this for a bit. :\ i want to pull in the changes to subject as v0.5.0 before resubmitting the autopilot manuscript though so it won't be indefinite.
This is a PR to fix (I think) the way ContinuousData is handled.
I created an example Task called ex_ContData to illustrate the issue. This task returns both TrialData and ContinuousData.
When I first ran it, I got the following errors:
I believe this is a simple syntax error.
task_class.ContinuousData.keys()
does not work, it should betask_class.ContinuousData.columns.keys()
. This is fixed by 920f3ed6bd285db07086fb019d257788464dd906After this, I got a new error:
Basically, when creating the ContinuousData group, it fails to properly detect the datatype of strings (when strings are returned as ContinuousData). This is because strings do not have a
dtype
field. Because this whole thing is wrapped in a try/except, it simply passes silently, and doesn't store any continuous data.This also interacts super weirdly with what I think is a bug in pytables. tables.Atom.from_type simply does not work on strings, as far as I can tell. You can see this even in their own docstrings, which contain an error, I guess they are using some code that inserts the result of running itself into the docstrings (which in this case is an error). https://www.pytables.org/_modules/tables/atom.html#Atom.from_type
The only way I could figure around this is to detect if the returned data is a string, and then use tables.StringAtom instead. The fix is in dbc585df2b53b179b5e9af0cf1ccf80629c80788
This actually works. I can now see ContinuousData!!
A remaining issue -- I don't know how to get the intended length of the string from ContinuousData column. Instead, I just use the length of the string that is passed -- but this will fail if a subsequent string is longer. Probably this is a simple fix. In general, we should probably be creating these Atom using the ContinuousData description itself, not the datatype of the provided data.
Also, just some questions about how ContinuousData is meant to work: