hepcat72 commented 3 months ago

FEATURE REQUEST

Inspiration

The new submission interface allows researchers (if they wish) to add compounds, tissues, treatments, LC Protocols, tracers, and infusates. Some or all of these require manual curation by the curators who are loading the data. These datums should be extracted from a submission and a consolidated doc should be updated. Curators should be alerted to this new data and scrutinize is. They should reach out to the researcher who submitted them if there are any problems, to coordinate resolutions.

Description

Data that is common to and shared across all studies should be monitored for changes and manually scrutinized by curators. We should create a script that:

Outputs the original xlsx study doc with common data sheets removed (e.g. the Compounds sheet)
Output a "consolidated_data.xlsx" doc with tabs for only the common data with warnings attached to new/changed data
Issues warnings for changes (new or modified) so the curator can scrutinize it (and those warnings applied as comments to affected rows)

Things to note:

Deletion of rows (by the researcher) of any of the study doc sheets will not delete data in the database, nor will there be an error (unless they replace that row with a new row that has conflicts with the deleted one).
Modifications to rows would manifest as ConflictingValueErrors, if a load attempt was made in dry-run mode
Checks for new data could either be a dry-run and an evaluation of the "created" stat or a Model.objects.get() call with all available fields supplied

An accompanying doc describing the load process from a curator perspective should accompany this effort.

Alternatives

None

Dependencies

None

Comment

None

ISSUE OWNER SECTION

Assumptions

None

Limitations

None

Affected Components

change: study_loader.py
change: load_study.py
add: curate_study.py
add: study_curater.py
add: load_curated.py
change: load_table.py
change: table_loader.py

Requirements

[ ] 1. A study doc is output that contains only consolidated data
[ ] 2. A study doc is output containing only study-specific data
[ ] 3. New consolidated data is highlighted in the consolidated data study doc
[ ] 4. Erroneous or warning data is highlighted in the consolidated study doc
[ ] 5. The consolidated doc contains ALL consolidated data (i.e. if a user removed rows, they are added back in)
[ ] 6. When splitting consolidated from study-specific data, nothing is loaded, despite no errors or warnings

DESIGN

Interface Change description

`curate_study.py` new script

Purpose

The intent of this script is to split the data into study-specific data and study-common/consolidated data, and in doing so, it highlights new/changed data to prompt the curator to scrutinize that data even if it has no errors.

Inputs

--infile study.xlsx (excel file): A submitted study
--outfile-study study-specific-data.xlsx (excel file) [<infile name>-study-specific-data.xlsx]: Where the study-specific data goes (i.e. no consolidated data is in this file)
--outfile-consolidated consolidated.xlsx (excel file) [consolidated-data-<timestamp>.xlsx]: Where the cross-study common data goes (i.e. all the consolidated data is in this file - no study-specific data)
Outputs
study specific xlsx file
consolidated data

`load_curated.py` new script

Purpose

This script will only load consolidated data from a consolidated-data-<timestamp>.xlsx file produced by curate_study.py. If the file contains study-specific data, it will be ignored.

Inputs

--infile consolidated-data-<timestamp>.xlsx (excel): A file produced by curate_study.py
Outputs

None (loads the database)

`load_study.py` changed script

Purpose

This script will now only load study-specific data from a <infile name>-study-specific-data.xlsx file produced by either curate_study.py or by the build-a-submission interface. If the file contains consolidated ("study-common") data, it will be ignored.

Note that it uses study_loader.py, which can process all data in validate mode. It just doesn't set the validate flag. That is supplied by the DataValidationView when it runs the StudyLoader.

Inputs

--infile <infile name>-study-specific-data.xlsx (excel): A file containing study-specific data (e.g. samples)
Outputs

None (loads the database)

Code Change Description

`study_loader.py`

The StudyLoader class will have a new class attribute that identifies loaders as either curated(/consolidated) or not (default: not curated/consolidated)
The StudyLoader constructor will take new keyword arguments:
- curate, which causes it to pass the curate option to specific loaders, e.g.: CompoundsLoader
- load_curated (defaulted to False), which causes to to ONLY call loaders that are identified as curated/consolidated (default will be to only load study-specific data)
- I might add a load_all option to load both curated/consolidated and study-specific data, but that's optional in this design.
  table_loader.py
The TableLoader constructor will take a new keyword argument: curate, which causes it to
- Buffer fatal warnings whenever any data in any of its models is created.
- Disable mass autoupdate (same as validate mode)
- Defer rollback (same as validate mode)
- Raise a CurationStatus exception (despite success/failure) that always triggers a rollback in the load_data wrapper
  study_curater.py
  
  Any common code/methods with DataValidationView may be pulled out into a separate class for re-use by this class.
This will be similar to the DataValidationView, in that it will
- Add missing data from the database
- Output (2 versions of) the study doc: study(-specific and study-common/consolidated).xlsx
- Apply error-comments to cells and create an Errors sheet
- Color errors and warnings
It will also:
- Color new rows
- Fill in rows that may have been removed by the user, or were introduced between submission and loading (by another study load)
  load_study.py
This script should actually NOT change, but I'm including it here to explain that it's outcome, when both curate and validate are both False (the StudyLoader default), it will never load the consolidated data.
I might add a --all option to load both curated/consolidated and study-specific data, but that's optional in this design.
curate_study.py
This will be exactly the same as load_study.py, except it will set the curate option to the derived StudyLoader constructor to True.
load_curated.py
This will be exactly the same as load_study.py, except it will set the curate option to the derived StudyLoader constructor to True.
load_table.py
This file should actually NOT change, but I'm including it here to explain that it's outcome, when both curate and validate are both False (the StudyLoader default), it will never load the consolidated data.

Tests

Unit test every new class/method

mneinast commented 2 months ago

Here's a relevant vignette from testing phase 1 of submission process that might highlight a way for users to update compound synonyms:

I tried uploading data containing compound names that do not exist as synonyms in tracebase (but the primary compound does exist). The validate page caught this. I then checked the Compounds tab in the study doc and found these problematic compounds: they were missing an HMDB ID and the synonyms field was empty:

I decided to modify the Columns page in my Study Doc: for each of these new synonyms, I added the correct HMDB ID:

From the user's perspective this is an easy way to identify the compound unambiguously. But I can see that this would create problems for the database (unless something is created that specifically handles this). The validator found these and threw an error:

PASSED: Samples Check PASSED: Peak Annotation Files Check PASSED: Tissues Check PASSED: Treatments Check PASSED: Compounds Check FAILED: col013b_study doc_240812.xlsx

ConflictingValueErrors
Conflicting values encountered during loading:
    During the processing of sheet [Compounds] in col013b_study doc_240812.xlsx...
    Creation of the following Compound record(s) encountered conflicts:
        File record:     {'name': 'Glucose 6-phosphate', 'formula': 'C6H13O9P', 'hmdb_id': 'HMDB0001401'} (on row(s): 3)
        Database record: {'name': 'glucose-6-phosphate', 'formula': 'C6H13O9P', 'hmdb_id': 'HMDB0001401'}
            [name] values differ:
            - database: [glucose-6-phosphate]
            - file:     [Glucose 6-phosphate]
        File record:     {'name': 'L-Lactic acid', 'formula': 'C3H6O3', 'hmdb_id': 'HMDB0000190'} (on row(s): 4)
        Database record: {'name': 'lactate', 'formula': 'C3H6O3', 'hmdb_id': 'HMDB0000190'}
            [name] values differ:
            - database: [lactate]
            - file:     [L-Lactic acid]

hepcat72 commented 2 months ago

BTW, I think your comment should probably be in a different issue, but ignoring that for now...

@mneinast - So I think I can accurately guess what's going on here. Here's what I expect happened.

When you ran the start page to autofill, the compound synonym didn't exist.
The compound existed, but it wasn't autofilled in the compounds sheet because the synonym search failed to find it
So the autofill added a partial record based on the information it got from the peak annotation file (the synonym and the formula). It created new rows for that data and could not fill in the HMDB ID nor the synonyms
Then, when you went to validate, having added the HMDB ID, (which is required to be unique), you ended up with a conflicting value error.
the solution is to add the synonyms from the Peak Annotations file to the existing compound records as synonyms instead of as separate compound records.
The problem is that the autofill doesn't provide that compound record for you to edit.

I think that the solution here is to auto-populate all existing compounds and hope that the user will see that the autofilled row for what is essentially a compound synonym of another row and delete that row and add the synonym to the other record.

I'll create an issue for this.

hepcat72 commented 2 months ago

1153 created

mneinast commented 2 months ago

Yes i agree that's what happened. Your proposed solution sounds good to me.

Princeton-LSI-ResearchComputing / tracebase

Script to extract common data from Study doc and alert curators to new data that needs checking #1049

FEATURE REQUEST

Inspiration

Description

Alternatives

Dependencies

Comment

ISSUE OWNER SECTION

Assumptions

Limitations

Affected Components

Requirements

DESIGN

Interface Change description

`curate_study.py` new script

Purpose

Inputs

Outputs

`load_curated.py` new script

Purpose

Inputs

Outputs

`load_study.py` changed script

Purpose

Inputs

Outputs

Code Change Description

`study_loader.py`

`table_loader.py`

`study_curater.py`

`load_study.py`

`curate_study.py`

`load_curated.py`

`load_table.py`

Tests

1153 created

Princeton-LSI-ResearchComputing / tracebase

Script to extract common data from Study doc and alert curators to new data that needs checking #1049

FEATURE REQUEST

Inspiration

Description

Alternatives

Dependencies

Comment

ISSUE OWNER SECTION

Assumptions

Limitations

Affected Components

Requirements

DESIGN

Interface Change description

curate_study.py new script

Purpose

Inputs

Outputs

load_curated.py new script

Purpose

Inputs

Outputs

load_study.py changed script

Purpose

Inputs

Outputs

Code Change Description

study_loader.py

table_loader.py

study_curater.py

load_study.py

curate_study.py

load_curated.py

load_table.py

Tests

1153 created

`curate_study.py` new script

`load_curated.py` new script

`load_study.py` changed script

`study_loader.py`

`table_loader.py`

`study_curater.py`

`load_study.py`

`curate_study.py`

`load_curated.py`

`load_table.py`