Implement a way to select a representation of a compound peak group

FEATURE REQUEST

Inspiration

In the TraceBase planning meeting on 9/4/2024, we discussed the multiple representations issue. We decided that we do indeed need to resolve multiple representations of peak group compounds before loading, but just need to make it easier.

Description

Implement a way to "select" the peak annotations file for each multiply represented compound that you want to use for that compound. Loading then would skip that compound in the other files (thereby eliminating the need to edit the peak annotation files).

Michael made a mockup of one way in which he imagined the selection would be made:

But one idea we had in the meeting was to make the selection in a sheet (named Peak Group Conflicts) in the study doc. One thought I had would be to make a sheet with the following columns:

Conflicting Peak Group
Sequence Name
Select a Peak Annotation File (dropdown showing the files the compound is in, in the sequence)

What the loader needs to know is the ones to skip. But given we know the sequence each peak annot file is derived from, we can just skip the listed compound when it occurs in an annot file belonging to the same sequence.

Alternatives

The problem with the feature described above is that it only records the desired file the representation to load is in. An alternative is to have a row for every compound, sequence, and file (instead of a dropdown) and add an extra column (either "skip" or "keep"), but that's not as easy on the user. However, it would also allow us to provide metadata, like the median ion count associated with that compound in each file.

The suggested feature above also relies on the assumption that a peak annotation file only ever contains data from 1 sequence. If it has data from multiple sequences, you could end up skipping a compound from the wrong sequence (where there is not a multiple representation - or some other complex association).

Dependencies

This issue cannot be fully completed until the following are completed:

Issue #1095 - In order to generate the sheet where the selections must be made, multiple representations must be detected in the autofill step.
PR #1196 - To ensure that the researcher did everything completely and correctly, we need the multiple representations error summary implemented here.
Issue #1097 - In order to know which files are from the same sequence, the sequence info must be supplied

Comment

So the process would go like this:

User uploads peak annot files on the submission start page
The interface detects multiple representations and if there are any, adds a sheet where selections must be made
The user selects the desired file for each compound
When the peak annotation files are loaded, it can read the added sheet to create a dict like: selected_representations[compound][sequence_name] = selected_peak_annot_file

Just note that this does not account for the possibility that different peak annot files could have complementary complements of samples. Since we're not recording the files where this compound must be skipped, there's the theoretical possibility that the compound could be skipped in a file where it shouldn't be skipped (because it simply contains different samples).

ISSUE OWNER SECTION

Assumptions

If a (case-insensitive) different synonym (of the same compound) is used, it will be assumed to be qualitatively different from peak groups linked to the same compounds, but with a different compound synonym.

Limitations

This will only allow selection of data from 1 file for each compound. I.e. it will not merge data in any way.

Affected Components

change: DataRepo/loaders/peak_annotations_loader.py
change: DataRepo/loaders/peak_annotation_files_loader.py
change: DataRepo/management/commands/load_peak_annotations.py
change: DataRepo/management/commands/load_peak_annotation_files.py
change: DataRepo/views/upload/submission.py
add: DataRepo/loaders/peak_group_conflicts.py

Requirements

[ ] 1. Users are able to select a peak annotation file name for each multiply represented peak group compound synonym from all peak annotation files that contain that peak group compound synonym for the same samples/sequence.
[ ] 2. A peak group will only be loaded from the peak annotation file the user selected. I.e. that compound for the same sample/sequence in other files will be skipped.
[ ] 3. Peak Groups whose names only differ by case will be treated as multiply represented
[ ] 4. If a peak group name includes synonyms from multiple compounds and at least 1 synonym differs (by more than just case) from that of another peak group name, the peak group will be loaded with all synonyms included and will not be considered to be a multiply represented peak group compound.
[ ] 5. The order of compound synonyms in a peak group name will not make them different if they have the same set of compound synonyms.
[ ] 6. A peak group with a subset of compound synonyms of another peak group constitutes a different peak group (i.e. not multiply represented).

DESIGN

Interface Change description

The loaders and commands will have new options for file, sheet, and/or dataframe for the multiply represented peak group compounds.

The excel study doc from the submission interface's start page will contain a "Peak Group Conflicts" sheet if multiple representations exist. Only conflicting peak group compounds will be included.

Code Change Description

A new loader PeakGroupConflicts will be created, whose load_data method just returns data (it performs no load). The data will be a dict of file names keyed on sequence name and sorted and lower-cased peak group name. It will define a sheet named Peak Group Conflicts that contains the following columns in the same manner as all of the other loaders:

Conflicting Peak Group
Sequence Name
Select a Peak Annotation File

A selected file will be required on every row. The Conflicting Peak Group and Sequence Name combo will be required to be unique.

Both the PeakAnnotationsLoader and PeakAnnotationFilesLoader (and their respective load command scripts) will take a file and sheet name, and the loaders will take a dataframe as well.

The PeakAnnotationsLoader will call PeakGroupConflicts.load_data() and save the returned dict and then test if that sequence and peak group name exists, and if it does, it checks that its file name matches. If it does not match and the sample is in both files, the peak group is skipped.

Tests

A test for each requirement

I think the dropdown in excel could work. From the user's perspective, it's easier to select the best representation than to mark the opposite.

I think the dropdown in excel could work. From the user's perspective, it's easier to select the best representation than to mark the opposite.

I'm trying to work up an example excel sheet to see what you think.

This is the mockup of a new sheet, conditionally added to the study doc (if there are multiple representations to choose from). Michael looked at it and liked it, so I will move forward with implementing this.

MASTER STUDY DOC FOR TRACEBASE_exp011b3_240816-edit-mock.xlsx

The description I supplied via slack:

OK @Michael Neinast - I just created an example of how I could represent the selections that need to be made for the conflicting compounds ("multiple peak group representations"). I created drop-down examples for the first 30 rows.

Researchers would be required to pick a file in the dropdowns for every row.

Note, this is a rough mock-up. Some rows are missing sequence names (but the code could easily include them). The sequence name elements are not in the same order as they are represented in other sheets (which is due to the way I quickly created this sheet).

A few notes:

The "Conflicting Compound" is the peak group name, so it won't necessarily correspond directly to a compound in the compounds sheet. Perhaps I should name it "Peak Group Name" instead, to avoid confusion?

The sequence name is necessary to be able to identify the files in which the compound must be skipped.

Compounds repeat because they are multiply represented in multiple sequences' peak annotation files.

The name of the added sheet is "Compound Measurement Conflicts". I only ever expect to include it IF there are multiple representations.

Alternatively, I could name it "Measured Compounds" (or "Peak Groups") and include every measured compound and only have the dropdowns for the rows where there is a conflict.

@mneinast - could you take a look at the assumptions, limitations, and requirements sections and let me know if there are any glaring omissions or problems? I know it's rather technical. I tried to break it up as granularly as possible, but I just want to be sure there aren't any significant problems.

Note also that I'm aware that the inclusion of MSRunSequence in the consideration of what constitutes multiple representations is in question, but until that's decided, I'm intentionally proceeding with the existing requirements. So if we eliminate that consideration, this will be changed.

I have a PR out now for a part of this issue.

To make sure I understand, is this essentially selecting a preferred representation out of a set of matched synonyms, and that two different synonyms for the same primary compound would still be loaded?

I'm writing out examples based on how I interpreted the Assumptions and Requirements above:

PrimaryCompoundName: "Glucose" Synonyms: "Glucose", "glucose", "D-glucose"

PeakGroup from file A: "glucose" PeakGroup from file B: "glucose" -> multiple representations...must select one

PeakGroup from file A: "glucose" PeakGroup from file B: "Glucose" -> multiple representations...must select one

PeakGroup from file A: "glucose" PeakGroup from file B: "D-glucose" -> not multiple representations...both are loaded

I think that my expectation would be for us to identify the final example as multiple representations. Later, we could decide to split the tracebase primary compound "Glucose" into a second primary compound "D-glucose" if we thought that was necessary.

@mneinast - Yes, you understand this correctly.

Currently, we don't have enough stored compound information to both:

equate glucose and d-glucose
differentiate l-glucose and "r-glucose" (i.e. stereo-isomers)

We do have enough information to equate any peak group linking to the same compound record(s), but we can't do both that and stereo-isomers. It's one or the other. (I know you know this already, but I'm just establishing this to set me up for the below...)

I know you stated you were thinking we should drop support for differentiating stereo-isomers. As a software engineer, my thinking is that until that (as a separate issue) is implemented and the final decision is made, extant designs should stick with the established requirements.

For example, for all I know, we could decide to add a stereo-isomer field to the compound synonym table that would allow us to equate glucose and d-glucose and differentiate glucose and l-glucose.

I could put off this issue until the stereo-isomer support issue is decided, however I don't think changing this would be much more work than the changes that would otherwise need to be made to drop support for stereo-isomer differentiation. Besides, my understanding is that loading the current outstanding studies is higher priority. And, there likely exist multiple representations (with different synonyms) elsewhere in the legacy data.

Princeton-LSI-ResearchComputing / tracebase