LorenFrankLab / spyglass

Neuroscience data analysis framework for reproducible research built by Loren Frank Lab at UCSF
https://lorenfranklab.github.io/spyglass/
MIT License
94 stars 42 forks source link

Export of selected table entries and associated files to a different database / spyglass install #1129

Open lfrank opened 1 month ago

lfrank commented 1 month ago

It seems possible (perhaps likely?) that different groups will have their own databases, but would like to be able to import a set of analyses / results from another group. This could be something like issue #861 but with a provision to transfer entries to a different database.

There are multiple complexities here, but if this were possible it might be that it would be really useful.

CBroz1 commented 1 month ago

Some questions come to mind regarding data integrity ...

  1. What if there are naming collisions in entries?
    • Case 1: LabA has 'subject1' and tries to load LabB's 'subject1', a different subject.
    • Case 2: LabA and LabB both have data from the same 'subject1', run with 'ParamsA', but this paramset was defined differently in each case.
    • How should a load handle conflicts? It could...
    • Simply reject a load with overlapping names
    • Assume collision refers to the same entity (e.g., assume default paramsets have not been changed)
    • Append some value to the loaded case, (e.g., 'subject1_imported{DATE}')
    • Pairwise compare every case of collision, including data stored as blobs, time intensive
  2. What if there are differences in table definitions?
    • Case 3: LabA has kept up with table alters (e.g., adding new fields), but LabB never ran these alters when updating Spyglass
    • Case 4: LabA and LabB do not share the exact same definition of a downstream custom table
    • How should a load handle these cases? It could ...
    • Reject the load
    • Rename the custom tables (e.g., 'CustomTableImported{DATE}')
    • Attempt to suggest changes to the imported file or alter existing tables

Any monitoring of the ingestion process to resolve collisions is going to be a major lift of parsing error messages from SQL, which DataJoint is better equipped for than Spyglass (maybe worth a feature request from them?). A skilled user could manage these decisions working with SQL directly, but I'm not confident in our ability to do it programmatically in Python. A featurefull approach might be an effort on par with expanding DataJoint by 30% to handle all possible error codes and reverting on fail.

An alternate approach might look more like a 'replication tool' that exported a spec of paramsets to run, and then applied them to a different database. This would require rerunning all computations, but it would allow datajoint and/or the end-user to handle collisions one-by-one

lfrank commented 1 month ago

Great points, and indeed the replication tool might be by far the best way to approach this given all the challenges. Let's discuss when you're back in town.