Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Implement a way to select a representation of a compound peak group #1199

Open hepcat72 opened 2 weeks ago

hepcat72 commented 2 weeks ago

FEATURE REQUEST

Inspiration

In the TraceBase planning meeting on 9/4/2024, we discussed the multiple representations issue. We decided that we do indeed need to resolve multiple representations of peak group compounds before loading, but just need to make it easier.

Description

Implement a was to "select" the peak annotations file for each multiply represented compound that you want to use for that compound. Loading then would skip that compound in the other files (thereby eliminating the need to edit the peak annotation files).

Michael made a mockup of one way in which he imagined the selection would be made:

multrepmock

But one idea we had in the meeting was to make the selection in a sheet (named Peak Group Conflicts) in the study doc. One thought I had would be to make a sheet with the following columns:

What the loader needs to know is the ones to skip. But given we know the sequence each peak annot file is derived from, we can just skip the listed compound when it occurs in an annot file belonging to the same sequence.

Alternatives

The problem with the feature described above is that it only records the desired file the representation to load is in. An alternative is to have a row for every compound, sequence, and file (instead of a dropdown) and add an extra column (either "skip" or "keep"), but that's not as easy on the user. However, it would also allow us to provide metadata, like the median ion count associated with that compound in each file.

The suggested feature above also relies on the assumption that a peak annotation file only ever contains data from 1 sequence. If it has data from multiple sequences, you could end up skipping a compound from the wrong sequence (where there is not a multiple representation - or some other complex association).

Dependencies

This issue cannot be fully completed until the following are completed:

Comment

So the process would go like this:

  1. User uploads peak annot files on the submission start page
  2. The interface detects multiple representations and if there are any, adds a sheet where selections must be made
  3. The user selects the desired file for each compound
  4. When the peak annotation files are loaded, it can read the added sheet to create a dict like: selected_representations[compound][sequence_name] = selected_peak_annot_file

Just note that this does not account for the possibility that different peak annot files could have complementary complements of samples. Since we're not recording the files where this compound must be skipped, there's the theoretical possibility that the compound could be skipped in a file where it shouldn't be skipped (because it simply contains different samples).


ISSUE OWNER SECTION

Assumptions

  1. If a (case-insensitive) different synonym (of the same compound) is used, it will be assumed to be qualitatively different from peak groups linked to the same compounds, but with a different compound synonym.

Limitations

  1. This will only allow selection of data from 1 file for each compound. I.e. it will not merge data in any way.

Affected Components

Requirements

DESIGN

Interface Change description

The loaders and commands will have new options for file, sheet, and/or dataframe for the multiply represented peak group compounds.

The excel study doc from the submission interface's start page will contain a "Peak Group Conflicts" sheet if multiple representations exist. Only conflicting peak group compounds will be included.

Code Change Description

A new loader PeakGroupConflicts will be created, whose load_data method just returns data (it performs no load). The data will be a dict of file names keyed on sequence name and sorted and lower-cased peak group name. It will define a sheet named Peak Group Conflicts that contains the following columns in the same manner as all of the other loaders:

A selected file will be required on every row. The Conflicting Peak Group and Sequence Name combo will be required to be unique.

Both the PeakAnnotationsLoader and PeakAnnotationFilesLoader (and their respective load command scripts) will take a file and sheet name, and the loaders will take a dataframe as well.

The PeakAnnotationsLoader will call PeakGroupConflicts.load_data() and save the returned dict and then test if that sequence and peak group name exists, and if it does, it checks that its file name matches. If it does not match and the sample is in both files, the peak group is skipped.

Tests

A test for each requirement

mneinast commented 2 weeks ago

I think the dropdown in excel could work. From the user's perspective, it's easier to select the best representation than to mark the opposite.

hepcat72 commented 2 weeks ago

I think the dropdown in excel could work. From the user's perspective, it's easier to select the best representation than to mark the opposite.

I'm trying to work up an example excel sheet to see what you think.

hepcat72 commented 1 week ago

This is the mockup of a new sheet, conditionally added to the study doc (if there are multiple representations to choose from). Michael looked at it and liked it, so I will move forward with implementing this.

MASTER STUDY DOC FOR TRACEBASE_exp011b3_240816-edit-mock.xlsx

The description I supplied via slack:

OK @Michael Neinast - I just created an example of how I could represent the selections that need to be made for the conflicting compounds ("multiple peak group representations"). I created drop-down examples for the first 30 rows.

Researchers would be required to pick a file in the dropdowns for every row.

Note, this is a rough mock-up. Some rows are missing sequence names (but the code could easily include them). The sequence name elements are not in the same order as they are represented in other sheets (which is due to the way I quickly created this sheet).

A few notes:

  • The "Conflicting Compound" is the peak group name, so it won't necessarily correspond directly to a compound in the compounds sheet. Perhaps I should name it "Peak Group Name" instead, to avoid confusion?
  • The sequence name is necessary to be able to identify the files in which the compound must be skipped.
  • Compounds repeat because they are multiply represented in multiple sequences' peak annotation files.

The name of the added sheet is "Compound Measurement Conflicts". I only ever expect to include it IF there are multiple representations.

Alternatively, I could name it "Measured Compounds" (or "Peak Groups") and include every measured compound and only have the dropdowns for the rows where there is a conflict.

hepcat72 commented 6 days ago

@mneinast - could you take a look at the assumptions, limitations, and requirements sections and let me know if there are any glaring omissions or problems? I know it's rather technical. I tried to break it up as granularly as possible, but I just want to be sure there aren't any significant problems.

Note also that I'm aware that the inclusion of MSRunSequence in the consideration of what constitutes multiple representations is in question, but until that's decided, I'm intentionally proceeding with the existing requirements. So if we eliminate that consideration, this will be changed.

I have a PR out now for a part of this issue.

mneinast commented 6 days ago

To make sure I understand, is this essentially selecting a preferred representation out of a set of matched synonyms, and that two different synonyms for the same primary compound would still be loaded?

I'm writing out examples based on how I interpreted the Assumptions and Requirements above:

PrimaryCompoundName: "Glucose" Synonyms: "Glucose", "glucose", "D-glucose"

PeakGroup from file A: "glucose" PeakGroup from file B: "glucose" -> multiple representations...must select one

PeakGroup from file A: "glucose" PeakGroup from file B: "Glucose" -> multiple representations...must select one

PeakGroup from file A: "glucose" PeakGroup from file B: "D-glucose" -> not multiple representations...both are loaded

I think that my expectation would be for us to identify the final example as multiple representations. Later, we could decide to split the tracebase primary compound "Glucose" into a second primary compound "D-glucose" if we thought that was necessary.

hepcat72 commented 6 days ago

@mneinast - Yes, you understand this correctly.

Currently, we don't have enough stored compound information to both:

  1. equate glucose and d-glucose
  2. differentiate l-glucose and "r-glucose" (i.e. stereo-isomers)

We do have enough information to equate any peak group linking to the same compound record(s), but we can't do both that and stereo-isomers. It's one or the other. (I know you know this already, but I'm just establishing this to set me up for the below...)

I know you stated you were thinking we should drop support for differentiating stereo-isomers. As a software engineer, my thinking is that until that (as a separate issue) is implemented and the final decision is made, extant designs should stick with the established requirements.

For example, for all I know, we could decide to add a stereo-isomer field to the compound synonym table that would allow us to equate glucose and d-glucose and differentiate glucose and l-glucose.

I could put off this issue until the stereo-isomer support issue is decided, however I don't think changing this would be much more work than the changes that would otherwise need to be made to drop support for stereo-isomer differentiation. Besides, my understanding is that loading the current outstanding studies is higher priority. And, there likely exist multiple representations (with different synonyms) elsewhere in the legacy data.