Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Update loading code to populate `LCMethod` instead of `Protocol` #706

Closed lparsons closed 8 months ago

lparsons commented 1 year ago

FEATURE REQUEST

Inspiration

Migration to new model

Description

Update the loading code to populate LCMethod. Typically, we will collect one set of data for each peak annotation file, but it would be ideal if when loading we could expand that to a table with:

Most of the time, we can use the submission input to generate a table, but this would give us the flexibility to manually generate/edit the table for more complex submissions where one peak annotation file has samples that use different LC methods, etc.

Alternatives

Dependencies

This issue cannot be started until the completion of the following issue(s):

Comment


ISSUE OWNER SECTION

Proposal 1 (Rob)

The following section delineates my proposal for handling loading of the LCMethod data. It is based on the following observations. A lab member can import any variety of mzXML files into EL-Maven and run accucor/isocorr in peaks picked from that process. Those mzXML files can be the product of having used different chromatography methods and different mass spec modes, e.g. neg/pos ion modes. (I'm not sure if the same sample can be included from different modes, but to be safe, I will assume that as well, and that the names of those samples will have suffixes appended, like "_pos" to make their names unique.) Hence, the LCMethod and mass spec modes are specific to the individual header representations of each sample. I.e. there's not one mode per accucor/isocorr file, nor is there one mode per "sample". There is one mode per "header representation of each sample", because each one is related to a single mzXML file.

Assumptions

Requirements

Limitations

Affected Components

DESIGN

Interface Change description

New options will be added to the loader. Example of new options:

$ python manage.py load_accucor_msruns \   # New options ONLY
    --ms-protocol-name "Default" \   # Just a rename of --protocol, for consistency
    --lc-protocol-name "polar-HILIC-25-min" \
    --instrument "default instrument" \
    --lcms-file sample_metadata.xlsx \
    --mzxml-files mymzxmlfiles/*.xml

A new LCMS metadata file will be able to be submitted. Example file:

$ head sample_lcms_metadata.csv
tracebase sample name   sample data header  mzxml filename  ms mode instrument  operator    date    lc method   lc run length   lc description
sample1 sample1 sample1.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample1 sample1_pos sample1_pos.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample2 sample2 sample2.xml Default default instrument  Michael Nienast 1972-11-24  mynewlcmethodtype   25  mynewlcmethodtype description, needed since it's new
sample3 sample3_neg sample3_neg.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 30  longer polar-HILIC description, needed since it's a different run length
sample4 sample4 sample4.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample5 sample5 sample5.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample6 sample6 sample6.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample7 sample7 sample7.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  
sample8 sample8 sample8.xml Default default instrument  Michael Nienast 1972-11-24  polar-HILIC 25  

New error types will be presented in the validation interface.

Code Change Description

New code to process the LCMS metadata csv/xls file will be added to load_accucor_msruns.py and passed to the accucor data loader in a manner similar to the processing of the accucor/isocorr files themselves. The metadata will be tracked during processing of the accucor/isocorr file and errors about missing or unused metadata will be buffered and raised en masse. If all LCMethod metadata is supplied, the name will be constructed and records will be created using get_or_create, supplying all data. If only the name is available or no description is provided, the method record will only be retrieved and an error will be buffered/raised if not found. In order to continue with processing, the unknown LCMethod record will be used until the completion of the full load (same as / consistent with the existing loading mechanisms).

Tests

Test that these methods in lc_method do what they're supposed to:

Test that these methods in exceptions do what they're supposed to:

Test that these methods in the accucor data loader do what they're supposed to:

Test that these methods in the sample table loader do what they're supposed to:

Test that these methods in the lcms metadata parser do what they're supposed to:

Requirements tests

hepcat72 commented 10 months ago

I assume "table" means an excel spreadsheet? Does the previous issue (#705) assume this? I worked some on 705 today and I did not infer this. I suspect that another sheet in the existing Excel template would be ideal.

Update the loading code to populate LCMethod. Typically, we will collect one set of data for each peak annotation file, but it would be ideal if when loading we could expand that to a table with:

  • AccuCor file
  • Sample (can do prefix matching)
  • mzXML filename
  • Researcher
  • Date
  • Instrument
  • lc method

Most of the time, we can use the submission input to generate a table, but this would give us the flexibility to manually generate/edit the table for more complex submissions where one peak annotation file has samples that use different LC methods, etc.

lparsons commented 10 months ago

I think this makes sense @hepcat72, but could you flesh out the proposal by mocking up the proposed new option/options to the load_accucor_msruns command as well as the columns in the proposed new file? I think that would help clarify this idea for me, since it's still a bit vague atm.

hepcat72 commented 10 months ago

Sure.

hepcat72 commented 10 months ago

OK @lparsons, I added examples in the Interface Change description.

lparsons commented 10 months ago

Thanks, that helps a lot. Here are a few questions to consider:

hepcat72 commented 10 months ago

Thanks, that helps a lot. Here are a few questions to consider:

  • What the ms-protocol-name refer to? I don't think there is any place this will be stored in the database.

--ms-protocol-name is a rename of --protocol, and as you may recall, at the time that I had started this design, I was confused about the exclusion of the MS mode (e.g. negative/positive ion mode). I did not update this design after we had the opportunity to discuss it on slack. And the result of that discussion was that we would wait and see what the search usage would be. [Incidentally, I remain unconvinced that the saved effort it would take to retain that data is worth the loss of its searchablity, but be that as it may, I am aware that this option is on the outs. I just haven't done it.]

  • Are the lc-protocol-name, instrument, and mzxml-files parameters optional? I'm guessing those would be used when when all of the samples share the same value, correct?

All of the options (which I will henceforth refer to as "defaults") that correspond to the columns in the LCMS metadata file (including lc-protocol-name, instrument, and mzxml-files) are conditionally required (/optional). Either the user provides an LCMS metadata file or they set those options. One or both are required. The defaults will be required if the LCMS metadata file only has a subset of samples (/sample data headers). If the LCMS metadata file has every sample in it, the "defaults" are not required.

I wanted the LCMS metadata file to only be required in order to map multiple different sample data headers to a single sample record in tracebase. You only need to put in it, headers whose names differ from the sample names. If all headers are the same as in the sample table file, the LCMS metadata file can be omitted and everything would work like it already does.

  • That would make the lcms-file optional?

Yes. See my explanation above. The lcms file is conditionally required with the "default" options.

  • Did you intent to require a xlsx file for lcms-file or a tsv file, or both?

It can be xlsx or csv, same as the sample/accucor files.

lparsons commented 10 months ago

OK, that sounds great, thanks for the clarification.

hepcat72 commented 9 months ago

TODO: