auto-pi-lot / autopilot

Distributed behavioral experiments
https://docs.auto-pi-lot.com
Mozilla Public License 2.0
89 stars 23 forks source link

reassign_protocol should check for changes to data description #67

Open mikewehr opened 3 years ago

mikewehr commented 3 years ago

If you change the TrialData format (e.g. by adding new columns) and then try to rebuild the data table by reassigning to the same protocol with the same name, reassign_protocol sees that the table already exists and silently doesn't do anything. Workaround for now is to rename the protocol to something like 'gap-laser2.json', but we should add logic that checks if additional columns need to be added here: https://github.com/wehr-lab/autopilot/blob/043fb6cf7c8642d4169be0823c773b358b2c4984/autopilot/core/subject.py#L538

slack thread: so it stashes the protocol description - ie. the contents of that json file - and marks that down, but it doesn’t move and rename the data, typically i was thinking that if you assigned a protocol it would have a different name, didn’t account for changes in data description that would require the table to be remade. I can instead add logic that checks if additional columns need to be added where there is that exception catch i linked above

Rodgers-PAC-Lab commented 2 years ago

Agree this is an issue, also discussed here: https://github.com/wehr-lab/autopilot/issues/148#issuecomment-1034394737

I'm not getting a NodeError, but rather a silent dropping of data that has been added to TrialData but not to the columns of HDF5. This commit logs the dropping: https://github.com/Rodgers-PAC-Lab/autopilot/commit/7bdd0e0b524da3f0ac52d589131ec79df81238b3

Will think about how to add a column on-the-fly (or possibly backup the old one, create a new one, and copy all the old data in).

sneakers-the-rat commented 2 years ago

working on this here: https://github.com/wehr-lab/autopilot/tree/detour-datamodel

specifically: https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/interfaces/tables.py#L131-L183

conflicting requirements: want to have explicit data models, but dont' want to drop data. if we expand the table to not drop data, then we have violated our data model. need a way to update the data model when this happens, but that's specified statically in the TrialData class attr. haven't arrived at a solution that satisfies me, but agree it's a problem.

sneakers-the-rat commented 2 years ago

Thinking more about the relationship between data models and storage, i feel like it's worth writing down more of what i've been thinking lately. I've been trying to generalize the representation of data so that it's possible to have multiple representations that map between each other -- ie. so it's possible to save data into a hdf5 table of arbitrary structure but also convert it into neurodata without borders and send it to your datajoint models. So I've been working on this in the short term with eye to the longterm -- try to start adapting the data model and then leave the breadcrumbs for means of declaring data models with the plugin system and also make a data modeling system that lets people write those mappings themselves.

So for the short-term:

I have written bidirectional translation functions from pytables descriptions to a Table class derived from pydantic models:

pydantic -> pytables:

https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/interfaces/tables.py#L242-L260

pytables -> pydantic

https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/interfaces/tables.py#L281-L308

that are currently lossy and I'm not sure of a good way to overcome that because columns can't have metadata, so i would need to write some metadata spec at the table level, which shouldnt' be too hard, but just needs to be done. I'm basing it on mappings between python types and hdf5 types, so the lossiness comes from needing to eg. translate between datetime objects and isoformatted strings, but then there isn't a good way to infer that something is a datetime from a string without just trying a ton of possible conversions and choosing the most specific one.

I've also started thinking about making specific coercion hierarchies, eg. https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/gui/widgets/model.py#L30-L34

so that it would be possible to resolve ambiguities when multiple conversion functions work. so that might be an alternative to "try everything"

Those get used in a few places, one that I was working on today was the data export method get_trial_data: https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/subject.py#L913

that tries to use a description of the protocol hdf5 structure:

https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/subject.py#L932-L933

that generates the structure from the task class a lot more explicitly than before:

https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/models/protocol.py#L132

but it's sort of a half-measure because i'm trying to limit the scope of the changes because i need to write my dissertation like now. So the protocol generation stuff also needs to get changed to adapt to using this more formal data modeling system, but it's already paying dividends on the other side where it's very easy to make a GUI to control these models:

https://github.com/wehr-lab/autopilot/blob/detour-datamodel/autopilot/gui/widgets/model.py

so there are some awkward workarounds. Most relevant here is that if the data model can't be found (eg the plugin isn't present in whatever system you're trying to export it on, in the future we'll store the plugin code in the file as well, or else be able to resolve a hash, though that has a lot of uh security implications so holding off on that for now), we try to recover it from the column descriptor in the table:

https://github.com/wehr-lab/autopilot/blob/d1a0b74a8e987f98756ce5fd2e2b605dd72ebbf7/autopilot/data/subject.py#L973-L982

so that we can reconstruct our abstract models from just the representation in the hdf5 table already. Idk it just makes it challenging to me because the goal is to get to a point where we can have verifiable and strictly-typed schemas that can be extensible and flexible but not have that turn into unusable mud. the constraint space is hard but for now i feel like giving a hard warning like "hey you're trying to save data that isn't in your model, you need to change your code in exactly this way to make things work right with predictable typing and everything, because that ultimately hurts you when it's time to share the data and your dates all come out as bytestrings or whatever, but we'll make sure it's saved for now."

cxrodgers commented 2 years ago

I don't have a ton to say in response to that, you've clearly thought about it a lot already. The one thing I will contribute is that I am pretty sure I'll always value some flexibility to change the table structure halfway through an experiment. Sure, in a perfect world we'd plan out a mouse's entire training structure from before it starts, but in practice I always find myself throwing in a new trial condition, which means a new column in the hdf5 table (or whatever). Right now my workaround has been something like Mike's "gap-laser2.json", where I basically increment the version number of the mouse, and go from there. Which isn't necessarily a bad solution, at least for now.

The way I dealt with this in the pre-Autopilot world was just dumping logfiles and totally untyped "results.json" files from every session. So the code running any given session was oblivious to the animal's training history, except insofar as that affected the parameters for that session. And then I would deal with stitching together sessions with distinct data structures with some post-processing code that was decoupled from running the behavior, which ultimately generated something much like Autopilot's hdf5 files. That worked well for me, but also it only had to handle my tasks and my changes, whereas the Autopilot community is much bigger and doing more heterogeneous things. Sot I get that you're going for something much more unified and strongly typed, and that makes sense and is a compelling vision. :+1:

sneakers-the-rat commented 2 years ago

I don't have a ton to say in response to that, you've clearly thought about it a lot already. The one thing I will contribute is that I am pretty sure I'll always value some flexibility to change the table structure halfway through an experiment.

I totally agree. The way that this gets handled is that when a change in the table structure is detected, we try and adapt the table to the change, and if we can't, then the old table is archived and a new one is created, and that change is logged. Same thing if a subject is reassigned to multiple tasks over its lifetime.

Sure, in a perfect world we'd plan out a mouse's entire training structure from before it starts, but in practice I always find myself throwing in a new trial condition, which means a new column in the hdf5 table (or whatever). Right now my workaround has been something like Mike's "gap-laser2.json", where I basically increment the version number of the mouse, and go from there. Which isn't necessarily a bad solution, at least for now.

I appreciate your patience, it's a pretty awkward fix in the short-term, but again totally agree that it should be possible to mutate data structures over time. The tricky bit is logging all those changes for the sake of provenance and also to be able to make sense of changing data structures over time (eg. we used to store responses as 'R' vs 'L' but now we're using an int for that... when did that happen? and what does it mean now?). getting there!

The way I dealt with this in the pre-Autopilot world was just dumping logfiles and totally untyped "results.json" files from every session. So the code running any given session was oblivious to the animal's training history, except insofar as that affected the parameters for that session. And then I would deal with stitching together sessions with distinct data structures with some post-processing code that was decoupled from running the behavior, which ultimately generated something much like Autopilot's hdf5 files. That worked well for me, but also it only had to handle my tasks and my changes, whereas the Autopilot community is much bigger and doing more heterogeneous things. Sot I get that you're going for something much more unified and strongly typed, and that makes sense and is a compelling vision. 👍

This is the norm for almost all experimental code that I've seen -- save "some data" in "some way" and then clean it afterwards. It works and is the most flexible way of doing it, but as i'm sure you've experienced it has its own brittleness: needing to write an additional software library just to clean the data, which then over time mutates into a lab-idiosyncratic storage system that will then have its own long history of analysis stages that build off of them... ultimately making it pretty hard to share data and analysis code downstream. Trying to make a system where it's possible to explicitly declare data structure, have it mutate, but then be able to create an intelligible schema that can be easily read and interpreted downstream. hard problem!