Open rly opened 4 days ago
Just putting here since tangential (though doesn't cover as much as proposed); we are currently tracking spyglass version in gereated nwb files (#897, #900)
I'd like to state the problem concisely before exploring solutions:
Shared table use leaves (a) data provenance in question when users may have different versions of the package, (b) data integrity in question when users may update upstream rows.
Does that capture it? I'll refer to these as 'the version problem' and 'the update problem' as I think out loud below.
Row-wise insertion/update information could track everything
alter
s, which usually run smoothly but would cause headaches if tables were forgotten or mismatched across pre/post implementation Data density could be mitigated by being selective in which tables had these fields. Merge tables or file tables (Nwbfile
, AnalysisNwbfile
) could be a good candidates, offering less protection against the update problem but decent timestamp gathering on who added a row when.
As Sam mentioned, spyglass version in nwb files already provides a sparse solution for the version problem, but it may be sufficiently dense to delete all cases in between file saves. Is an additional tool needed to compare across all files from the same session? Is an additional field needed for user?
The mixin would allow us to intercept insert
and/or update
to add additional log rows for each. Likely a better case for the update problem than the version problem.
varchar(255)
If row-wise is too dense for version tracking, can we keep track of when a user updates their environment, and when their environment is editable, using snippets provided by Ryan in #1087.
A new 'user history' table loads whenever spyglass is imported. It checks the user's current spyglass version and/or conda environment against their last entry, and inserts a row if different from last time
A BugHistory
lookup table could list versions and tables whose saved files were impacted. A BugChecker
would look at which files were saved with that version, by that table, and return the list of files for suggested reprocessing, along with the session experimenter. This is easier if version is added to AnalysisNwbfile
as a field
If we're worried about 'update', we can disable it for non-admin either in SQL or in the python interface.
If we're worried about whether or not a given downstream key/file could be exactly replicated, we could instead focus on doing that, rather that data tracking possible errors.
Given a downstream key, we can already export a vertical slice of the database, with all paramset table pairings. I spent some time exploring such a tool in #1057 with the idea of running the same analysis on a different session. We could fully solve the version problem for a given analysis by instead rerunning the same session(s).
Hypothetical goal: The production database a staging ground; to finalize analyses means a dockerized from-scratch rerun
I propose adding a line to the builtin log for update1
and some sort of tool on Nwbfile
to check for updates across schemas for a given session.
I'm in favor of sparser methods of protecting against the version issue until we see cases of definite red flags where we retroactively wished we had access to row-wise logging such as this. My gut says that adding the user history log, and adding created-by user to analysis files, is enough to cross reference for who made which file when. Not a full solve of the version problem, but perhaps an adequate record to prevent full reprocessing
For common tables, users may accidentally (or intentionally) modify a row created by another user. So it is useful to know who was the last person who modified a row and when. Similarly, it is useful to know who created the row in the first place.
A couple types of actions (declare table, drop table, delete from table) are tracked in the DataJoint hidden
~log
table that is part of each database (e.g.,common_nwbfile
), but those are limited to table-wide actions and deletions, and it may be hard to parse for everyday users.There are many ways to do this, but since we care only about the latest state of each row, one common way seems to be to add columns to the data tables about the latest state:
created_by
orcreation_user
(user) (with or without IP address)created_on
orcreated_time
orcreation_timestamp
(datetime)last_modified_by
(user)last_modified_on
orlast_modified_time
orlast_modification_timestamp
(datetime)last_modified_reason
(string, e.g., "database migration" or "fix typo") (not sure if useful)last_modified_spyglass_version
(string) (not sure if useful)spyglass_environment
- a text-based dump of the virtual environment, host name, and operating system, but I think that is overkill and would result in a bloated database. we have thought about this for NWB and are settling on version of the primary software package as being good enough for our initial attempt at tracking provenance)This is now doable in the latest version of DataJoint (not released yet?) without cluttering the common display of tables (i.e., the columns are "hidden" but can be queried/displayed).
Alternative: we could create a history table for every data table and add a foreign key from the data table to the history table. I'm not sure if this separation adds any value now that we can have hidden columns.
Concern: Both approaches will increase the size of the database. Is it worthwhile?
It would be nice if this were built into DataJoint as mysql triggers, but until then, we could add the values every time we call
populate
.@CBroz1 noted:
Questions: