cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
92 stars 25 forks source link

feat. req.: separate interface for study-level data #211

Closed turbomam closed 1 year ago

turbomam commented 3 years ago

I have really been enjoying experimenting with DataHarmonizer and I have appreciated our meetings. I can't remember whether we discussed this yet...

Could there be some user interface component, separate from the existing HOT sheet for sample metadata, that captures study level data? I can even imagine it as a static form. I don't imagine it requiring multiple rows.

Then, the user shouldn't have to enter any of that information in the existing sample metadata sheet, but upon save or export, those values would be pasted in as constants values in additional columns.

I guess we could just start out with columns for those study parameters, and the user would just fill out one row and copy/paste across the other rows.

Another solution would be just inserting a "part_of" column into the existing sample metadata sheet and then populating the study ID. It would be cool if we could export this imaginary study metadata and the sample metadata into two JSON files. See #209

turbomam commented 3 years ago

I see that you are already capturing some of that project data in the CanCOGeN template. For NMDC, one of our prototypes has assumed that all sample submitted in a given batch will belong to a single study, project, whatever.

Do you find that people use DataHarmonizer for capturing COVID data associated with multiple BioProjects in a single batch?

ddooley commented 3 years ago

We can add a column or more in a table that hold key values in a table that join to other tables. Right now there's no functionality for managing those joins, i.e. whether the key value for a join is valid or not, and no nice interface to click on a join and get moved to another dataharmonizer tab where that target template - in your case of project level data - exists and can be edited.

dehays commented 3 years ago

@turbomam Managing references between attributes of different classes / sheets - might be something more easily supported by interfaces for some server side persistence and validation. So you could have a study template and different flavors of biosample templates. Save (POST) the study. Then use a validation to validate on an instance of biosample to check that the study identifiers map to existing study records. This begins to define a workflow (create a study and then create biosamples that refer to that study) which is more application use case specific than what DataHarmonizer needs to do itself.

ddooley commented 2 years ago

We're turning to scoping this functionality design again now due to some pressing needs for an AMR project. Emma has asked for:

the ability to handle one-to-many relationships between samples, isolates, sequences and assay profiles using sample, isolate, sequence identifiers as keys. For example, a sample may be collected on a certain date, in a certain place, and be associated with a sample type. If several colonies are isolated from the sample, the identical sample information must be entered over and over again for the different isolates. If different sequences are produced from the same isolate, the identical isolate and sample information must be entered over and over again. Modularization and auto-fill functions will greatly help to reduce the need for repetitive data entry.

Now that DataHarmonizer runs via the browser API, it should be easier to present users with components where a main DH table can link records to a view/edit form for 1-many records. The design question that follows is what data format can we store such edits in? @dehays as you say it could be server side esp. if that's where the 1-many relations are stored/managed. But we're also interested in maintaining a stand-alone browser version too, and so wondering about a single file multi-table data format, and so thought right away of SQLite (though I'm not sure if it can be loaded without a webserver). Are there any other formats we should consider for stand-alone operation?