Update loading code to populate `MSRunSequence` and `MSRunSample` directly

lparsons commented 1 year ago

FEATURE REQUEST

Inspiration

Migration to new model

Description

Update the loading code to populate MSRunSequence and MSRunSample. Typically, we will collect one set of data for each peak annotation file, but it would be ideal if when loading we could expand that to a table with:

AccuCor file
Sample (can do prefix matching)
mzXML filename
Researcher
Date
Instrument
LC Method

Most of the time, we can use the submission input to generate a table, but this would give us the flexibility to manually generate/edit the table for more complex submissions where one peak annotation file has samples that use different LC methods, etc.

Remove all references to test tag broken_until_issue712.

Alternatives

Dependencies

This issue cannot be started until the completion of the following issue(s):

709 (PR submitted)
710 (PR submitted)
711 (merged)

Comment

Branch: load_msrun_sample_sequence

ISSUE OWNER SECTION

Note, there is significant overlap between this issue and the already implemented issue #706, which implements the "table" described in the issue description. All that's needed is to populate the correct Models, remove the broken tags (broken_until_issue712), and:

Add the missing fields, like instrument, ms_data_file, and ms_raw_file (the last 2 as an ArchiveFile records and save the file)
Test for instrument, ms_data_file, and ms_raw_file in test_lcms_metadata_loading.py (see 4 TODO comments)
Set MSRunSample.null = False
Uncomment code indicated in TODO in test_models.py
Refine ResearcherNotNew to take a list of researchers (see TODO in accucor_data_loader.py)
[x] Create separate issue to: Update the validation interface for the new input files

Assumptions

None

Requirements

[x] 1. None of the load scripts result in MSRun being loaded
[x] 2. Any load script that loaded MSRun must load the same data in MSRunSample and MSRunSequence instead
[x] 3. PeakGroup.msrun_sample.null must be set to False
[x] 4. Add migration for PeakGroup.msrun_sample change
[x] 5. All broken_until_issue712 test tegs must be removed
[x] 6. ResearcherNotNew must take a list of researchers (see TODO in accucor_data_loader.py)
[x] 7. polarity value processing changes Decided on in planning meeting on 12/13/2023
[x] 7.1. Add a polarity choices value: "unknown"
[x] 7.2. Parse polarity from the mzXML (if supplied) (in addition to the existing command line default and LCMS metadata column)
[x] 7.3. Polarity value precedence: mzXML file value > LCMS metadata file value > command line value > static "unknown" value
[x] 7.4. A command line default polarity value is no longer required
[x] 7.5. A default polarity should be removed from the study submission form.
[x] 7.6. Raise exception if LCMS metadata polarity value differs from what's parsed from the mzXML file (if it was supplied)

Limitations

This issue will not cover update of the validation interface. This may require new broken tags added to some tests.

Affected Components

A tentative list of anticipated repository items that will be changed, labeled with "add", "delete", or "change". One item per line. (Mostly, this will be a list of files.)

change: DataRepo/tests/...
change: DataRepo/utils/accucor_data_loader.py
change: DataRepo/utils/lcms_metadata_parser.py
change: DataRepo/utils/exceptions.py

DESIGN

Interface Change description

No outward interface changes compared to what was already implemented in #774.

Code Change Description

The changes should be pretty simple, and similar to the type of changes implemented already in the DataRepo/migrations/0027_msrun_to_msrunsample_msrunsequence.py file in #804. It will do a get_or_create on the MSRunSequence and MSRunSample, except it will load the files as ArchiveFile records (if provided) and the instrument.

Tests

A test for each requirement

hepcat72 commented 8 months ago

merged

hepcat72 commented 8 months ago

@lparsons - I filled in a design and added the design:needs-review tag. I'm going to proceed with this issue (branched off branch migrate_msrun from PR #804) because it should be pretty straightforward, as most of this was already done in #774. If you see any design issues, please indicate the specific design item and highlight the specific issue and add the design:changes-requested tag and a comment.

lparsons commented 8 months ago

@hepcat72 Looks good to me. I would suggest that you create an issue to update the validation interface, since we know that will be needed. Might be a useful place to stash things you notice as you work on this.

hepcat72 commented 8 months ago

hepcat72 commented 7 months ago

@lparsons - I added requirement 7. (and sub-items) based on the discussion about the polarity value in the meeting. I think I was true to what was decided in the meeting, but let me know if you see anything untoward.

lparsons commented 7 months ago

It still seems useful to have the command line option to supply polarity. We want options for the curators. We just want to simplify the requests of the researchers.

When loading data, the polarity can be determined as follows:

If an mzXML file was supplied, parse the value from that file
If a value is supplied from the LCMS metadata file, read that value
If there was an value supplied on the command line, read that value Check that all values found match, error if they do not

When asking the researchers, do not explicity ask for polarity. Curators may decide to supply polarity values even if we don't have an mzXML file, but we shouldn't start by asking.

So, my attempt at putting that into your format above:

Polarity can be supplied in three ways: parsed from mzXML, read from LCMS metadata file, supplied on the command line a. If multiple values are supplied for a given MSRun, ensure they match, error if not b. If no value is supplied, the default should be unknown
The study submission process should not ask for the polarity explicitly a. Remove polarity from the form b. Keep the column in the LCMS data file, but we won't explicitly ask for it, and it should be optional

Princeton-LSI-ResearchComputing / tracebase