Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Script to extract common data from Study doc and alert curators to new data that needs checking #1049

Open hepcat72 opened 3 months ago

hepcat72 commented 3 months ago

FEATURE REQUEST

Inspiration

The new submission interface allows researchers (if they wish) to add compounds, tissues, treatments, LC Protocols, tracers, and infusates. Some or all of these require manual curation by the curators who are loading the data. These datums should be extracted from a submission and a consolidated doc should be updated. Curators should be alerted to this new data and scrutinize is. They should reach out to the researcher who submitted them if there are any problems, to coordinate resolutions.

Description

Data that is common to and shared across all studies should be monitored for changes and manually scrutinized by curators. We should create a script that:

  1. Outputs the original xlsx study doc with common data sheets removed (e.g. the Compounds sheet)
  2. Output a "consolidated_data.xlsx" doc with tabs for only the common data with warnings attached to new/changed data
  3. Issues warnings for changes (new or modified) so the curator can scrutinize it (and those warnings applied as comments to affected rows)

Things to note:

An accompanying doc describing the load process from a curator perspective should accompany this effort.

Alternatives

None

Dependencies

None

Comment

None


ISSUE OWNER SECTION

Assumptions

None

Limitations

None

Affected Components

Requirements

DESIGN

Interface Change description

curate_study.py new script
Purpose

The intent of this script is to split the data into study-specific data and study-common/consolidated data, and in doing so, it highlights new/changed data to prompt the curator to scrutinize that data even if it has no errors.

Inputs
load_curated.py new script
Purpose

This script will only load consolidated data from a consolidated-data-<timestamp>.xlsx file produced by curate_study.py. If the file contains study-specific data, it will be ignored.

Inputs
load_study.py changed script
Purpose

This script will now only load study-specific data from a <infile name>-study-specific-data.xlsx file produced by either curate_study.py or by the build-a-submission interface. If the file contains consolidated ("study-common") data, it will be ignored.

Note that it uses study_loader.py, which can process all data in validate mode. It just doesn't set the validate flag. That is supplied by the DataValidationView when it runs the StudyLoader.

Inputs

Code Change Description

study_loader.py

Tests

Unit test every new class/method

mneinast commented 2 months ago

Here's a relevant vignette from testing phase 1 of submission process that might highlight a way for users to update compound synonyms:

I tried uploading data containing compound names that do not exist as synonyms in tracebase (but the primary compound does exist). The validate page caught this. I then checked the Compounds tab in the study doc and found these problematic compounds: they were missing an HMDB ID and the synonyms field was empty: image

I decided to modify the Columns page in my Study Doc: for each of these new synonyms, I added the correct HMDB ID: image

From the user's perspective this is an easy way to identify the compound unambiguously. But I can see that this would create problems for the database (unless something is created that specifically handles this). The validator found these and threw an error:

PASSED: Samples Check PASSED: Peak Annotation Files Check PASSED: Tissues Check PASSED: Treatments Check PASSED: Compounds Check FAILED: col013b_study doc_240812.xlsx

ConflictingValueErrors
Conflicting values encountered during loading:
    During the processing of sheet [Compounds] in col013b_study doc_240812.xlsx...
    Creation of the following Compound record(s) encountered conflicts:
        File record:     {'name': 'Glucose 6-phosphate', 'formula': 'C6H13O9P', 'hmdb_id': 'HMDB0001401'} (on row(s): 3)
        Database record: {'name': 'glucose-6-phosphate', 'formula': 'C6H13O9P', 'hmdb_id': 'HMDB0001401'}
            [name] values differ:
            - database: [glucose-6-phosphate]
            - file:     [Glucose 6-phosphate]
        File record:     {'name': 'L-Lactic acid', 'formula': 'C3H6O3', 'hmdb_id': 'HMDB0000190'} (on row(s): 4)
        Database record: {'name': 'lactate', 'formula': 'C3H6O3', 'hmdb_id': 'HMDB0000190'}
            [name] values differ:
            - database: [lactate]
            - file:     [L-Lactic acid]
hepcat72 commented 2 months ago

BTW, I think your comment should probably be in a different issue, but ignoring that for now...

@mneinast - So I think I can accurately guess what's going on here. Here's what I expect happened.

I think that the solution here is to auto-populate all existing compounds and hope that the user will see that the autofilled row for what is essentially a compound synonym of another row and delete that row and add the synonym to the other record.

I'll create an issue for this.

hepcat72 commented 2 months ago

1153 created

mneinast commented 2 months ago

Yes i agree that's what happened. Your proposed solution sounds good to me.