Update loading code to populate `LCMethod` instead of `Protocol`

lparsons commented 1 year ago

FEATURE REQUEST

Inspiration

Migration to new model

Description

Update the loading code to populate LCMethod. Typically, we will collect one set of data for each peak annotation file, but it would be ideal if when loading we could expand that to a table with:

AccuCor file
Sample (can do prefix matching)
mzXML filename
Researcher
Date
Instrument
lc method

Most of the time, we can use the submission input to generate a table, but this would give us the flexibility to manually generate/edit the table for more complex submissions where one peak annotation file has samples that use different LC methods, etc.

Alternatives

Dependencies

This issue cannot be started until the completion of the following issue(s):

703
704
705

Comment

ISSUE OWNER SECTION

Proposal 1 (Rob)

The following section delineates my proposal for handling loading of the LCMethod data. It is based on the following observations. A lab member can import any variety of mzXML files into EL-Maven and run accucor/isocorr in peaks picked from that process. Those mzXML files can be the product of having used different chromatography methods and different mass spec modes, e.g. neg/pos ion modes. (I'm not sure if the same sample can be included from different modes, but to be safe, I will assume that as well, and that the names of those samples will have suffixes appended, like "_pos" to make their names unique.) Hence, the LCMethod and mass spec modes are specific to the individual header representations of each sample. I.e. there's not one mode per accucor/isocorr file, nor is there one mode per "sample". There is one mode per "header representation of each sample", because each one is related to a single mzXML file.

Assumptions

The sample table load will provide a mapping of samples to the corresponding sample headers in all of the accucor/isocorr files and that will be loaded into the database in a MSRun* table
The MS mode will eventually be preserved in the database even though it's not currently planned
The format and other unmentioned existing options will persist.

Requirements

[x] 1. Entry methods of LCMS Metadata
[x] 1.1. A command line option must be available to specify a default:
[x] 1.1.1. LCMethod name (when a method per sample header representation is not supplied)
[x] 1.1.2. Peak Annotation Filename (already exists) (use for checks)
[x] 1.1.3. Researcher (already exists)
[x] 1.1.4. Date (already exists)
[x] 1.1.5. Instrument (no place to load yet - just record the value in the code and mark with TODO comment)
[x] 1.1.6. MS Mode (e.g. positive ion mode)
[x] 1.2. A command line option for an LCMS metadata file in either CSV or XLSX format with the following columns:
[x] 1.2.1. LCMethod Type
[x] 1.2.2. LCMethod Run Length
[x] 1.2.3. LCMethod Run Description (optional, if the method already exists)
[x] 1.2.4. Peak Annotation Filename (When supplying with a peak annotation file, this seemed a bit redundant, but given I realized that the LCMS file could be supplied with the sample table for completeness checks, including this column makes more sense, so I un-crossed this out)
[x] 1.2.5. Researcher
[x] 1.2.6. Date
[x] 1.2.7. MS Mode (e.g. positive ion mode)
[x] 1.2.8. Sample Name
[x] 1.2.9. Sample Data Header
[x] 1.2.10. mzXML Filename
[x] 1.2.11. Instrument
[x] 1.3. LCMS options' requiredness
[x] 1.3.1. LCMS metadata Options specified under 1.1. and 1.2. must be conditionally required (only necessary if not every metadata value is specified for every sample)
[x] 1.3.2. Option 1.2. must be required if the same sample has multiple sample data headers in the peak annotation files included with a sample table / single study load
[x] 2. A command line option must be available to specify all of the mzXML files
[x] 3. LCMethod records must be created by the accucor data loader
[x] 4. MSRun records must link to the newly created LCMethod records
[x] 5. msrun_protocol Protocol records must be retained by the end of this implementation (until the MS modes have migrated into a database field)
[x] 6. Exceptions are buffered in a way consistent/compatible with the validation interface

Limitations

This work will not parse the mzXML files to extract LCMS metadata

Affected Components

change: all broken tests
change: ms_run.py
change: load_accucor_msruns.py
change: accucor_data_loader.py

DESIGN

Interface Change description

New options will be added to the loader. Example of new options:

$ python manage.py load_accucor_msruns \   # New options ONLY
    --ms-protocol-name "Default" \   # Just a rename of --protocol, for consistency
    --lc-protocol-name "polar-HILIC-25-min" \
    --instrument "default instrument" \
    --lcms-file sample_metadata.xlsx \
    --mzxml-files mymzxmlfiles/*.xml

A new LCMS metadata file will be able to be submitted. Example file:

$ head sample_lcms_metadata.csv
tracebase sample name   sample data header  mzxml filename  ms mode instrument  operator    date    lc method   lc run length   lc description
sample1 sample1 sample1.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample1 sample1_pos sample1_pos.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample2 sample2 sample2.xml Default default instrument  Michael Nienast 1972-11-24  mynewlcmethodtype   25  mynewlcmethodtype description, needed since it's new
sample3 sample3_neg sample3_neg.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 30  longer polar-HILIC description, needed since it's a different run length
sample4 sample4 sample4.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample5 sample5 sample5.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample6 sample6 sample6.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample7 sample7 sample7.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample8 sample8 sample8.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25

New error types will be presented in the validation interface.

Code Change Description

New code to process the LCMS metadata csv/xls file will be added to load_accucor_msruns.py and passed to the accucor data loader in a manner similar to the processing of the accucor/isocorr files themselves. The metadata will be tracked during processing of the accucor/isocorr file and errors about missing or unused metadata will be buffered and raised en masse. If all LCMethod metadata is supplied, the name will be constructed and records will be created using get_or_create, supplying all data. If only the name is available or no description is provided, the method record will only be retrieved and an error will be buffered/raised if not found. In order to continue with processing, the unknown LCMethod record will be used until the completion of the full load (same as / consistent with the existing loading mechanisms).

Tests

Test that these methods in lc_method do what they're supposed to:

[x] create_name
[x] get_name

Test that these methods in exceptions do what they're supposed to:

[x] exception_type_exists

Test that these methods in the accucor data loader do what they're supposed to:

[x] sample_header_to_default_mzxml
[x] check_mzxml
[x] validate_mzxmls
[x] get_missing_required_lcms_defaults
[x] lcms_defaults_supplied
[x] get_or_create_ms_protocol
[x] get_or_create_lc_protocol

Test that these methods in the sample table loader do what they're supposed to:

[x] check_lcms_samples

Test that these methods in the lcms metadata parser do what they're supposed to:

[x] lcms_df_to_dict
[x] lcms_metadata_to_samples
[x] extract_dataframes_from_lcms_xlsx
[x] lcms_headers_are_valid
[x] extract_dataframes_from_lcms_tsv

Requirements tests

1. Entry methods of LCMS Metadata
1.1. Tests
- [x] 1. Test that these options/arguments exist for the accucor load:
- LCMethod name
- Peak Annotation Filename
- Researcher
- Date
- Instrument
- MS Mode
- [x] 2. LCMS metadata file option accepts either TSV or XLSX
- [x] 3. The columns in the LCMS metadata file include
- LCMethod Type
- LCMethod Run Length
- LCMethod Run Description
- Peak Annotation Filename
- Researcher
- Date
- MS Mode (e.g. positive ion mode)
- Sample Name
- Sample Data Header
- mzXML Filename
- Instrument
[x] 1.2. Test that values missing in the LCMS metadata fall back to the defaults from 1.1.
1.3. LCMS options' requiredness
1.3.1. Tests
- [x] 1. Any missing sample header in the LCMS metadata file causes an error if not all required defaults are specified
- [x] 2. Any missing column value LCMS metadata file causes an error about either needing a value or supply a default
- [x] 3. Duplicate sample data headers (assumed to be to the same sample) cause an error
[x] 1.3.2. Test that the LCMS sample column must correspond to a unique sample in the sample table loader
[x] 2. Test that an option/arg exists for multiple mzXML files
[x] 3. Test that LCMethod records are created
[x] 4. Test that MSRun records link to LCMethod records
[x] 5. Test msrun_protocol Protocol records are created
6. Tests
- [x] 1. Test that the accucor data loader processes every row despite exceptions
- [x] 2. Test that no exceptions are repeated
- [x] 3. Test that there are no exceptions aside from the expected ones

hepcat72 commented 1 year ago

I assume "table" means an excel spreadsheet? Does the previous issue (#705) assume this? I worked some on 705 today and I did not infer this. I suspect that another sheet in the existing Excel template would be ideal.

Update the loading code to populate LCMethod. Typically, we will collect one set of data for each peak annotation file, but it would be ideal if when loading we could expand that to a table with:

AccuCor file

Sample (can do prefix matching)

mzXML filename

Researcher

Date

Instrument

lc method

Most of the time, we can use the submission input to generate a table, but this would give us the flexibility to manually generate/edit the table for more complex submissions where one peak annotation file has samples that use different LC methods, etc.

lparsons commented 1 year ago

I think this makes sense @hepcat72, but could you flesh out the proposal by mocking up the proposed new option/options to the load_accucor_msruns command as well as the columns in the proposed new file? I think that would help clarify this idea for me, since it's still a bit vague atm.

hepcat72 commented 1 year ago

Sure.

hepcat72 commented 1 year ago

OK @lparsons, I added examples in the Interface Change description.

lparsons commented 1 year ago

Thanks, that helps a lot. Here are a few questions to consider:

What the ms-protocol-name refer to? I don't think there is any place this will be stored in the database.
Are the lc-protocol-name, instrument, and mzxml-files parameters optional? I'm guessing those would be used when when all of the samples share the same value, correct?
That would make the lcms-file optional?
Did you intent to require a xlsx file for lcms-file or a tsv file, or both?

hepcat72 commented 1 year ago

Thanks, that helps a lot. Here are a few questions to consider:

What the ms-protocol-name refer to? I don't think there is any place this will be stored in the database.

--ms-protocol-name is a rename of --protocol, and as you may recall, at the time that I had started this design, I was confused about the exclusion of the MS mode (e.g. negative/positive ion mode). I did not update this design after we had the opportunity to discuss it on slack. And the result of that discussion was that we would wait and see what the search usage would be. [Incidentally, I remain unconvinced that the saved effort it would take to retain that data is worth the loss of its searchablity, but be that as it may, I am aware that this option is on the outs. I just haven't done it.]

Are the lc-protocol-name, instrument, and mzxml-files parameters optional? I'm guessing those would be used when when all of the samples share the same value, correct?

All of the options (which I will henceforth refer to as "defaults") that correspond to the columns in the LCMS metadata file (including lc-protocol-name, instrument, and mzxml-files) are conditionally required (/optional). Either the user provides an LCMS metadata file or they set those options. One or both are required. The defaults will be required if the LCMS metadata file only has a subset of samples (/sample data headers). If the LCMS metadata file has every sample in it, the "defaults" are not required.

I wanted the LCMS metadata file to only be required in order to map multiple different sample data headers to a single sample record in tracebase. You only need to put in it, headers whose names differ from the sample names. If all headers are the same as in the sample table file, the LCMS metadata file can be omitted and everything would work like it already does.

That would make the lcms-file optional?

Yes. See my explanation above. The lcms file is conditionally required with the "default" options.

Did you intent to require a xlsx file for lcms-file or a tsv file, or both?

It can be xlsx or csv, same as the sample/accucor files.

lparsons commented 1 year ago

OK, that sounds great, thanks for the clarification.

hepcat72 commented 1 year ago

TODO:

[x] Implement missed tests:
- [x] test_get_lcms_metadata_dict_from_file
- [x] test_check_peak_annotation_files
[x] Remove msrun protocol references
[x] Make the keys consistent in lcms_defaults and lcms_metadata
- [x] Loop on the keys in initialize_sample_names
- [x] Remove the optional variable from get_missing_required_lcms_defaults
[x] Take "mzxml_files" out of lcms_defaults as a separate member variable
[x] Add a docstring to validate_mzxmls
[x] Move exceptions to the exceptions file

Princeton-LSI-ResearchComputing / tracebase