Modify the accucor loader to use the study doc tabs instead of the LCMS metadata file

FEATURE REQUEST

Inspiration

The accucor data loader is bloated. Breaking it up will simplify and streamline the study submission process and the loading code. Consolidating all of the LCMS metadata into the new Study excel doc will simplify and streamline the submission process and make it easier for users.

Description

Add an LCMS metadata details tab to the study excel doc and modify the accucor data loader to use it, the sequences tab, and the defaults tab.

Alternatives

None

Dependencies

Parent: #753
Required: #824

Comment

I created an example version of the Study Excel doc:

~animal_sample_table.xlsx~

study.xlsx

Note, the following should either be incorporated into this issue or split into another issue...

It was debugged on tracebase branch raw_abund_and_tme_errors using rabinowitz data containing the issue in a branch by the same name. It can be used to further test and debug with the following command:

python manage.py load_study ../tracebase-rabinowitz-data/exp027f_long_KLI_timecourse/loading.yaml

From a slack message:

FWIW, I'm just "live-tweeting" at this point, but the transaction management errors are in fact coming from subsequent rows of the same sample, i.e. they are coming from different iterations of the inner loop. You can tell that from the status prints...
Reading Accucor file: /Users/rleach/PROJECT-local/TRACEBASE/tracebase-rabinowitz-data/exp027f_long_KLI_timecourse/220909_exp027f4_free_plasma_and_tissues/exp027f4_free_plasma_and_tissues_pos_low_mz_corrected.xlsx

...

      Inserting peak data for urea:label-0 for sample exp027f4_free_M02_plasma

...
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase2/DataRepo/utils/accucor_data_loader.py", line 1831, in load_data
    peak_data.save()
...
EXCEPTION2(ERROR): ValueError: Field 'raw_abundance' expected a number but got ''.

      Inserting peak data for urea:label-1 for sample exp027f4_free_M02_plasma

...
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase2/DataRepo/utils/accucor_data_loader.py", line 1830, in load_data
    peak_data.full_clean()
...
EXCEPTION3(ERROR): TransactionManagementError: An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.

...
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase2/DataRepo/utils/accucor_data_loader.py", line 1753, in load_data
    peak_group_label_rec = peak_group.labels.first()
...
EXCEPTION4(ERROR): TransactionManagementError: An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.
So the progression is:

urea:label-0 for sample exp027f4_free_M02_plasma hits the raw_abundance ValueError (in the "Original" sheet) and fails the save

urea:label-1 has the same empty value issue, but that manifests as a TME when it tries to clean (before saving)

The last TME is the next iteration as well (but before the status print) and has an exception when it tries to retrieve a label

It's interesting because the raw abundance field is null=True. I also think that this is an issue that has been around since the first iteration of TraceBase, only it manifests a little differently given the attempt to keep going to find more issues. I bet that since the exception was raised during the save, subsequent attempts to clean subsequent records produces the TMEs.

So I think there is technically something broken (regardless of whether we fix this issue by skipping that sample). I think what should happen is that an empty cell in the raw_abundance column should evaluate to None instead of an empty string, because it's a numeric field with nulls allowed. Then everything should be able to proceed.

There are multiple pieces of code in my current PR, that when applied to the accucor loading code, will fix this issue.

The get_row_val method always converts empty strings to None.

The new required column value feature will address similar issues (but not this one since null=True.

The exception handling in the new PR provides file context, so you can find the offending data.

The new code automatically skips data that was produced an error.

I implemented a way to define the column types in the loader class so that if we encounter an error like this (but not an empty string), we'll get a better, more relevant, error.

ISSUE OWNER SECTION

Assumptions

None

Limitations

None

Affected Components

change: load_accucor_msruns.py
change: accucor_data_loader.py

Requirements

[x] 1. Should be able to load mzXML files independently. See #906.
[x] 2. Every mzXML file must be associated with a (database) Sample and an MSRunSequence
~[ ] 3. polarity, mz_min, and mz_max are not required to be specified manually, but are parsed from the file. (Note, polarity and the mz scan range only exist to differentiate MSRunSample records when the mzXML files are not supplied.)~
[x] 3. Only 1 placeholder record will be allowed for any sample/sequence combo
- [x] 3.1. polarity, mz_min, and mz_max will not be specified in a file or via command line option. They will only be parsed from the mzXML file(s).
- [x] 3.2. Placeholder records will not have these values defined since multiple peak groups that link to them can be composed of data coming from multiple mzXML files that can have different such values.

The requirements from #753:

[x] 4. The accucor data loader will

[x] 4.1. Take the study doc instead of an LCMS Metadata file

~[ ] 4.1.1. Merge the Peak Annotation Details and Sequences sheet~ This is now handled using a comma-delimited "Sequence Name" column, populated using drop-downs

~[ ] 4.1.2. Re-use the LCMS metadata processing code with the new merged sheets~

[x] ~4.2. Use the defaults tab instead of command line options~ Use either the defaults tab OR command line options(/loader constructor arguments.) (Lance added a requirement to be able to load mzXML files without an infile.)

[x] 4.3. Be split up into:
[x] 4.3.1. SequencesLoader (taking the "Sequences" sheet)
[x] 4.3.2. MSRunsLoader (taking the "Peak Annotation Details" sheet and/or a list of mzXML files)
[x] 4.3.3. PeakAnnotationsLoader (taking the "Peak Annotation Files" and "Peak Annotation Details" sheet)

[x] 8.9. Peak Annotation Files Tab (I realized that this sheet may be (partially?) unnecessary. At the least, the file type is unnecessary, since the type could likely be derived, and the sample name prefix is unnecessary given that the sample names are directly associated with headers in the peak annotation details sheet. I just haven't decided whether the list of files can (or should) be derived from the Peak Annotation Details sheet. If the Peak Annotation Details sheet is optional, then we would need a Peak Annotation Files sheet. At the very least, the sheet can be completely and automatically populated.)

[x] 8.9.1. Add Columns

[x] 8.9.1.1. Peak Annotation File Name

~[ ] 8.9.1.2. Peak Annotation File Type~

~[ ] 8.9.1.3. Sample Name Prefix~

[x] 8.10. Peak Annotation Details Tab

[x] 8.10.1. Add Columns

[x] 8.10.1.1. Sample Name

[x] 8.10.1.2. Sample Data Header

[x] 8.10.1.3. mzXML File Name

[x] 8.10.1.4. Peak Annotation File Name

[x] 8.10.1.5. Polarity

[x] 8.10.1.6. Sequence Number

DESIGN

Interface Change description

Note that I no longer think a Peak Annotation Files Tab is necessary. Sample prefix is/should-be embedded in the sample name column of the Peak Annotation Details tab, accucor/isocorr format can-be(/is) detected automatically now, and the peak annotation file names themselves will be in the Peak Annotation Details tab anyway - no other associated data for those is necessary.

The current accucor loader will be broken up into the following separate loadersloader will

Sequences Loader

Already Done. See #824.

peak annotations loader

The (currently named) accucor loader will take either the study doc containing a "Peak Annotation Details" sheet or a tsv file (currently referred to in the codebase as an "LCMS metadata file") and a peak annotations file. The accucor loader will no longer take a series of mzXML files (as it currently does). The accucor data loader will no longer load the MSRunSample model (as it currently does). It will only link created peak groups to existing MSRunSample records.

Users will record whether samples are blanks in the Peak Annotation Details sheet/file using an optional extra "Skip" column. Such samples will not need to be in the Samples sheet - only in the Peak Annotation Details sheet. The sample name column is required (even though blanks won't go into the database. The peak annotation file column will be required, as will the sample data header and sequence (same as is required for every other row). All of this will be auto-populated using the "Build a Submission" page.

There will be a Peak Annotation Files sheet (that is not loaded into the database). Although the format should be able to be automatically determined, there will be a file type column (accucor/isocorr/isoautocorr). An additional Skip column will be added (if for any reason, a file included in the study is not ready to be loaded). And there is no need for a sample prefix, ~though I could add a blank samples column (comma-delimited) as an alternative to including them in the Peak Annotation Details sheet~.

mzXML / MSRunSample loader

A separate loader (tentatively named MSRunsLoader/msruns_loader.py/load_msruns.py), specifically for the MSRunSample (and ArchiveFile) models will be created that takes mzXML files and creates MSRunSample records for them (if necessary). Sample and MSRunSequence records must already exist (see the sequences and sample table loaders). This loader will largely ignore the peak annotation file names. A "Peak Annotation Details" (sheet in study.xlsx or separate tsv file) will be optional, as the relationship between mzXML and Sample can often be automatically determined. The script will use either the sample header (whose name typically matches the peak annotation header) or the mzXML name itself to match to a database sample record. The sample name column will be assumed to contain any prefix assigned to the sample. And the sheet will contain a Sequence Name column (as opposed to the Sequence Number referred to in requirement 8.10.1.6. which will be composed of the composite key (comma-delimited) that will be used to identify the associated sequence record). This follows with established design patterns for all other loaders. The features established in #948 and those described in #753 and #829 will be used to help the user fill in the Sequence Name column (e.g. a dropdown in excel, populated by the Sequence Name column in the sequences sheet as of #948 will be used to populate the column of the same name in the Peak Annotation Details sheet). In the case of a tsv, it will have to be manually populated.

Code Change Description

Sequences Loader

Already Done. See #824.

peak annotations loader

PeakAnnotationsLoader

Will inherit from TableLoader
Will have a convert_df abstract method that takes a dataframe or dataframe dict keyed on sheet name and returns a df that is universal.
- If a csv/tsv is the source, it will be a dataframe.
The Peak Annotation Files sheet will get its own loader class, but it will not have it's load_data method called. Instead, a method will be added called get_data that will return a dict of dicts keyed on annotation file. The dict it returns will be supplied to this loader's constructor.
A get_data method will also be added to the msruns_loader. The dict it returns will be supplied to this loader's constructor.
The PeakAnnotationLoader's load_data method will cycle through all peak annotation files from the dict from the Peak Annotation Files sheet and through all the samples from the dict from the Peak Annotation Details sheet to create peak group and peak data records.

The derived classes should be very small. They will pretty much do what the current accucor loader does, only using the overall loading design established by the other loaders.

Skeleton Example:

class PeakAnnotationsLoader(TableLoader):
    @property
    @abstractmethod
    def column_renames(self):
        pass

    @property
    @abstractmethod
    def single_sheet(self):
        """Sheet name of a sheet (if it exists) that is completely sufficient for loading"""
        pass

    @abstractmethod
    def convert_df(self, df):
        pass

    ... Usual loader class

class IsocorrLoader(PeakAnnotationsLoader):
    column_renames = {...}
    def convert_df(self, df):
        outdf = df
        if isinstance(df, dict):
            outdf = df["absolte"]
        return df.rename(columns=self.column_renames)

mzXML / MSRunSample loader

The loader class will inherit from TableLoader and define concrete class attributes for the abstract ones defined in TableLoader. The following are a few of the settings that will be configured in those attributes:

Required columns/values will be:
- Sample name
- Either Sample Data Header or mzXML File Name
- Sequence Name

The load script will inherit from LoadTableCommand and add options for the following inputs:

mzXML files
~Options: --researcher, --date, --lc-protocol-name, --instrument~ Rob, 4/19/2024: I realized that, in keeping with the established curator patterns, the options specified by the curator should be recorded in the study itself, so instead of command line options, we will use either the existing --defaults-file or --defaults-sheet to convey these default values.

Option requirements:

mzXML files
and/or either:
- --infile (an option added by LoadTableCommand, will be made to be conditionally required) containing/using
- a "Peak Annotation Details" sheet (the default value of the --defaults-sheet option) and
- optionally, a "Defaults" sheet (the default value of the --defaults-sheet option)
- ~All of: --researcher, --date, --lc-protocol-name, --instrument~ (see note: Rob, 4/19/2024, above) --defaults-file specifying defaults for the Operator, Instrument, LC Protocol Name, and Date.

If both --infile and the ~above options (any of --researcher, --date, --lc-protocol-name, --instrument)~ defaults are supplied, the ~option~ default values are treated as defaults, as is already supported by LoadTableCommand. This also means that those ~option~ defaults are supported using the Defaults sheet of the study.xlsx doc.

The concrete implementation of the abstract load_data method will:

Traverse the supplied mzXML files and create ArchiveFile records. Extract data from the mzxML files and store in a 3D dict like: {mzXML_name: {archive_file_record_id: {"record": record, "polarity": parsed_polarity, "mz_min": parsed_mz_min, "mz_max": parsed_mz_max}}})
Traverse the infile and create MSRunSample records, associating them with the mzXML ArchiveFile records and Sample and MSRunSequence records along the way (keeping track of which mzXMLs have been added to MSRunSample records)
Traverse leftover mzXML/ArchiveFile records unassociated with those processed in step 2, using:
- The name of the mzXML automatically mapped to a sample name
- The default --researcher, --date, --lc-protocol-name, --instrument supplied (if any were not supplied, error).

Tests

A test for each requirement

Princeton-LSI-ResearchComputing / tracebase