LorenFrankLab / spyglass

Neuroscience data analysis framework for reproducible research built by Loren Frank Lab at UCSF
https://lorenfranklab.github.io/spyglass/
MIT License
89 stars 41 forks source link

Proposal for tracking row-based provenance #1113

Open rly opened 4 days ago

rly commented 4 days ago

For common tables, users may accidentally (or intentionally) modify a row created by another user. So it is useful to know who was the last person who modified a row and when. Similarly, it is useful to know who created the row in the first place.

A couple types of actions (declare table, drop table, delete from table) are tracked in the DataJoint hidden ~log table that is part of each database (e.g., common_nwbfile), but those are limited to table-wide actions and deletions, and it may be hard to parse for everyday users.

There are many ways to do this, but since we care only about the latest state of each row, one common way seems to be to add columns to the data tables about the latest state:

This is now doable in the latest version of DataJoint (not released yet?) without cluttering the common display of tables (i.e., the columns are "hidden" but can be queried/displayed).

Alternative: we could create a history table for every data table and add a foreign key from the data table to the history table. I'm not sure if this separation adds any value now that we can have hidden columns.

Concern: Both approaches will increase the size of the database. Is it worthwhile?

It would be nice if this were built into DataJoint as mysql triggers, but until then, we could add the values every time we call populate.

@CBroz1 noted:

I think it would require the mixin intercepting the definition attr before table declaration and appending new lines. There would also be some migration effort of altering all existing tables

Questions:

  1. Is this enhancement worth making? How often are tables modified in a way that it is useful to track who/when/why a change was made?
  2. If yes, what do you think about the approach?
samuelbray32 commented 4 days ago

Just putting here since tangential (though doesn't cover as much as proposed); we are currently tracking spyglass version in gereated nwb files (#897, #900)

CBroz1 commented 3 days ago

Problem

I'd like to state the problem concisely before exploring solutions:

Shared table use leaves (a) data provenance in question when users may have different versions of the package, (b) data integrity in question when users may update upstream rows.

Does that capture it? I'll refer to these as 'the version problem' and 'the update problem' as I think out loud below.

Solutions

Hidden fields

Row-wise insertion/update information could track everything

Data density could be mitigated by being selective in which tables had these fields. Merge tables or file tables (Nwbfile, AnalysisNwbfile) could be a good candidates, offering less protection against the update problem but decent timestamp gathering on who added a row when.

File as sparse monitoring

As Sam mentioned, spyglass version in nwb files already provides a sparse solution for the version problem, but it may be sufficiently dense to delete all cases in between file saves. Is an additional tool needed to compare across all files from the same session? Is an additional field needed for user?

Expanding built-in log use

The mixin would allow us to intercept insert and/or update to add additional log rows for each. Likely a better case for the update problem than the version problem.

Custom log table

If row-wise is too dense for version tracking, can we keep track of when a user updates their environment, and when their environment is editable, using snippets provided by Ryan in #1087.

A new 'user history' table loads whenever spyglass is imported. It checks the user's current spyglass version and/or conda environment against their last entry, and inserts a row if different from last time

A BugHistory lookup table could list versions and tables whose saved files were impacted. A BugChecker would look at which files were saved with that version, by that table, and return the list of files for suggested reprocessing, along with the session experimenter. This is easier if version is added to AnalysisNwbfile as a field

Permissions changes

If we're worried about 'update', we can disable it for non-admin either in SQL or in the python interface.

Replication tool

If we're worried about whether or not a given downstream key/file could be exactly replicated, we could instead focus on doing that, rather that data tracking possible errors.

Given a downstream key, we can already export a vertical slice of the database, with all paramset table pairings. I spent some time exploring such a tool in #1057 with the idea of running the same analysis on a different session. We could fully solve the version problem for a given analysis by instead rerunning the same session(s).

Hypothetical goal: The production database a staging ground; to finalize analyses means a dockerized from-scratch rerun

General thoughts

I propose adding a line to the builtin log for update1 and some sort of tool on Nwbfile to check for updates across schemas for a given session.

I'm in favor of sparser methods of protecting against the version issue until we see cases of definite red flags where we retroactively wished we had access to row-wise logging such as this. My gut says that adding the user history log, and adding created-by user to analysis files, is enough to cross reference for who made which file when. Not a full solve of the version problem, but perhaps an adequate record to prevent full reprocessing