Closed hepcat72 closed 2 months ago
MSRuns Loader branch Tests
DataRepo/loaders/msruns_loader.py
(MSRunsLoader
)
DataRepo/tests/views/upload/test_validation.py
DataRepo/management/commands/load_msruns.py
DataRepo/management/commands/load_table.py
DataRepo/models/archive_file.py
DataRepo/models/peak_group.py
DataRepo/utils/exceptions.py
RequiredOptions
DataRepo/models/utilities.py
update_rec
FEATURE REQUEST
Inspiration
The accucor data loader is bloated. Breaking it up will simplify and streamline the study submission process and the loading code. Consolidating all of the LCMS metadata into the new Study excel doc will simplify and streamline the submission process and make it easier for users.
Description
Add an LCMS metadata details tab to the study excel doc and modify the accucor data loader to use it, the sequences tab, and the defaults tab.
Alternatives
None
Dependencies
Comment
I created an example version of the Study Excel doc:
~animal_sample_table.xlsx~
study.xlsx
Note, the following should either be incorporated into this issue or split into another issue...
It was debugged on
tracebase
branchraw_abund_and_tme_errors
using rabinowitz data containing the issue in a branch by the same name. It can be used to further test and debug with the following command:From a slack message:
ISSUE OWNER SECTION
Assumptions
None
Limitations
None
Affected Components
load_accucor_msruns.py
accucor_data_loader.py
Requirements
1.
Should be able to load mzXML files independently. See #906.2.
Every mzXML file must be associated with a (database)Sample
and anMSRunSequence
3.
polarity
,mz_min
, andmz_max
are not required to be specified manually, but are parsed from the file. (Note, polarity and the mz scan range only exist to differentiateMSRunSample
records when themzXML
files are not supplied.)~3.
Only 1 placeholder record will be allowed for any sample/sequence combo3.1.
polarity
,mz_min
, andmz_max
will not be specified in a file or via command line option. They will only be parsed from the mzXML file(s).3.2.
Placeholder records will not have these values defined since multiple peak groups that link to them can be composed of data coming from multiple mzXML files that can have different such values.The requirements from #753:
4.3.
Be split up into:4.3.1.
SequencesLoader (taking the "Sequences" sheet)4.3.2.
MSRunsLoader (taking the "Peak Annotation Details" sheet and/or a list of mzXML files)4.3.3.
PeakAnnotationsLoader (taking the "Peak Annotation Files" and "Peak Annotation Details" sheet)DESIGN
Interface Change description
Note that I no longer think a Peak Annotation Files Tab is necessary. Sample prefix is/should-be embedded in the sample name column of the Peak Annotation Details tab, accucor/isocorr format can-be(/is) detected automatically now, and the peak annotation file names themselves will be in the Peak Annotation Details tab anyway - no other associated data for those is necessary.
The current accucor loader will be broken up into the following separate loadersloader will
Sequences Loader
Already Done. See #824.
peak annotations loader
The (currently named) accucor loader will take either the study doc containing a "Peak Annotation Details" sheet or a tsv file (currently referred to in the codebase as an "LCMS metadata file") and a peak annotations file. The accucor loader will no longer take a series of mzXML files (as it currently does). The accucor data loader will no longer load the
MSRunSample
model (as it currently does). It will only link created peak groups to existingMSRunSample
records.Users will record whether samples are blanks in the Peak Annotation Details sheet/file using an optional extra "Skip" column. Such samples will not need to be in the Samples sheet - only in the Peak Annotation Details sheet. The sample name column is required (even though blanks won't go into the database. The peak annotation file column will be required, as will the sample data header and sequence (same as is required for every other row). All of this will be auto-populated using the "Build a Submission" page.
There will be a Peak Annotation Files sheet (that is not loaded into the database). Although the format should be able to be automatically determined, there will be a file type column (accucor/isocorr/isoautocorr). An additional Skip column will be added (if for any reason, a file included in the study is not ready to be loaded). And there is no need for a sample prefix, ~though I could add a blank samples column (comma-delimited) as an alternative to including them in the Peak Annotation Details sheet~.
mzXML / MSRunSample loader
A separate loader (tentatively named
MSRunsLoader
/msruns_loader.py
/load_msruns.py
), specifically for theMSRunSample
(andArchiveFile
) models will be created that takes mzXML files and createsMSRunSample
records for them (if necessary).Sample
andMSRunSequence
records must already exist (see the sequences and sample table loaders). This loader will largely ignore the peak annotation file names. A "Peak Annotation Details" (sheet instudy.xlsx
or separate tsv file) will be optional, as the relationship between mzXML and Sample can often be automatically determined. The script will use either the sample header (whose name typically matches the peak annotation header) or the mzXML name itself to match to a database sample record. The sample name column will be assumed to contain any prefix assigned to the sample. And the sheet will contain aSequence Name
column (as opposed to theSequence Number
referred to in requirement8.10.1.6.
which will be composed of the composite key (comma-delimited) that will be used to identify the associated sequence record). This follows with established design patterns for all other loaders. The features established in #948 and those described in #753 and #829 will be used to help the user fill in the Sequence Name column (e.g. a dropdown in excel, populated by the Sequence Name column in the sequences sheet as of #948 will be used to populate the column of the same name in the Peak Annotation Details sheet). In the case of a tsv, it will have to be manually populated.Code Change Description
Sequences Loader
Already Done. See #824.
peak annotations loader
PeakAnnotationsLoader
load_data
method called. Instead, a method will be added calledget_data
that will return a dict of dicts keyed on annotation file. The dict it returns will be supplied to this loader's constructor.get_data
method will also be added to themsruns_loader
. The dict it returns will be supplied to this loader's constructor.load_data
method will cycle through all peak annotation files from the dict from the Peak Annotation Files sheet and through all the samples from the dict from the Peak Annotation Details sheet to create peak group and peak data records.The derived classes should be very small. They will pretty much do what the current accucor loader does, only using the overall loading design established by the other loaders.
Skeleton Example:
mzXML / MSRunSample loader
The loader class will inherit from
TableLoader
and define concrete class attributes for the abstract ones defined inTableLoader
. The following are a few of the settings that will be configured in those attributes:The load script will inherit from
LoadTableCommand
and add options for the following inputs:--researcher
,--date
,--lc-protocol-name
,--instrument
~ Rob, 4/19/2024: I realized that, in keeping with the established curator patterns, the options specified by the curator should be recorded in the study itself, so instead of command line options, we will use either the existing--defaults-file
or--defaults-sheet
to convey these default values.Option requirements:
--infile
(an option added byLoadTableCommand
, will be made to be conditionally required) containing/using--defaults-sheet
option) and--defaults-sheet
option)--researcher
,--date
,--lc-protocol-name
,--instrument
~ (see note: Rob, 4/19/2024, above)--defaults-file
specifying defaults for the Operator, Instrument, LC Protocol Name, and Date.If both
--infile
and the ~above options (any of--researcher
,--date
,--lc-protocol-name
,--instrument
)~ defaults are supplied, the ~option~ default values are treated as defaults, as is already supported byLoadTableCommand
. This also means that those ~option~ defaults are supported using theDefaults
sheet of thestudy.xlsx
doc.The concrete implementation of the abstract
load_data
method will:ArchiveFile
records. Extract data from the mzxML files and store in a 3D dict like:{mzXML_name: {archive_file_record_id: {"record": record, "polarity": parsed_polarity, "mz_min": parsed_mz_min, "mz_max": parsed_mz_max}}}
)infile
and createMSRunSample
records, associating them with themzXML
ArchiveFile
records andSample
andMSRunSequence
records along the way (keeping track of whichmzXML
s have been added toMSRunSample
records)mzXML
/ArchiveFile
records unassociated with those processed in step 2, using:--researcher
,--date
,--lc-protocol-name
,--instrument
supplied (if any were not supplied, error).Tests
A test for each requirement