Closed hepcat72 closed 2 months ago
I think the dropdown in excel could work. From the user's perspective, it's easier to select the best representation than to mark the opposite.
I think the dropdown in excel could work. From the user's perspective, it's easier to select the best representation than to mark the opposite.
I'm trying to work up an example excel sheet to see what you think.
This is the mockup of a new sheet, conditionally added to the study doc (if there are multiple representations to choose from). Michael looked at it and liked it, so I will move forward with implementing this.
MASTER STUDY DOC FOR TRACEBASE_exp011b3_240816-edit-mock.xlsx
The description I supplied via slack:
OK @Michael Neinast - I just created an example of how I could represent the selections that need to be made for the conflicting compounds ("multiple peak group representations"). I created drop-down examples for the first 30 rows.
Researchers would be required to pick a file in the dropdowns for every row.
Note, this is a rough mock-up. Some rows are missing sequence names (but the code could easily include them). The sequence name elements are not in the same order as they are represented in other sheets (which is due to the way I quickly created this sheet).
A few notes:
- The "Conflicting Compound" is the peak group name, so it won't necessarily correspond directly to a compound in the compounds sheet. Perhaps I should name it "Peak Group Name" instead, to avoid confusion?
- The sequence name is necessary to be able to identify the files in which the compound must be skipped.
- Compounds repeat because they are multiply represented in multiple sequences' peak annotation files.
The name of the added sheet is "Compound Measurement Conflicts". I only ever expect to include it IF there are multiple representations.
Alternatively, I could name it "Measured Compounds" (or "Peak Groups") and include every measured compound and only have the dropdowns for the rows where there is a conflict.
@mneinast - could you take a look at the assumptions, limitations, and requirements sections and let me know if there are any glaring omissions or problems? I know it's rather technical. I tried to break it up as granularly as possible, but I just want to be sure there aren't any significant problems.
Note also that I'm aware that the inclusion of MSRunSequence in the consideration of what constitutes multiple representations is in question, but until that's decided, I'm intentionally proceeding with the existing requirements. So if we eliminate that consideration, this will be changed.
I have a PR out now for a part of this issue.
To make sure I understand, is this essentially selecting a preferred representation out of a set of matched synonyms, and that two different synonyms for the same primary compound would still be loaded?
I'm writing out examples based on how I interpreted the Assumptions and Requirements above:
PrimaryCompoundName: "Glucose" Synonyms: "Glucose", "glucose", "D-glucose"
PeakGroup from file A: "glucose" PeakGroup from file B: "glucose" -> multiple representations...must select one
PeakGroup from file A: "glucose" PeakGroup from file B: "Glucose" -> multiple representations...must select one
PeakGroup from file A: "glucose" PeakGroup from file B: "D-glucose" -> not multiple representations...both are loaded
I think that my expectation would be for us to identify the final example as multiple representations. Later, we could decide to split the tracebase primary compound "Glucose" into a second primary compound "D-glucose" if we thought that was necessary.
@mneinast - Yes, you understand this correctly.
Currently, we don't have enough stored compound information to both:
glucose
and d-glucose
l-glucose
and "r-glucose
" (i.e. stereo-isomers)We do have enough information to equate any peak group linking to the same compound record(s), but we can't do both that and stereo-isomers. It's one or the other. (I know you know this already, but I'm just establishing this to set me up for the below...)
I know you stated you were thinking we should drop support for differentiating stereo-isomers. As a software engineer, my thinking is that until that (as a separate issue) is implemented and the final decision is made, extant designs should stick with the established requirements.
For example, for all I know, we could decide to add a stereo-isomer field to the compound synonym table that would allow us to equate glucose
and d-glucose
and differentiate glucose
and l-glucose
.
I could put off this issue until the stereo-isomer support issue is decided, however I don't think changing this would be much more work than the changes that would otherwise need to be made to drop support for stereo-isomer differentiation. Besides, my understanding is that loading the current outstanding studies is higher priority. And, there likely exist multiple representations (with different synonyms) elsewhere in the legacy data.
FEATURE REQUEST
Inspiration
In the TraceBase planning meeting on 9/4/2024, we discussed the multiple representations issue. We decided that we do indeed need to resolve multiple representations of peak group compounds before loading, but just need to make it easier.
Description
Implement a way to "select" the peak annotations file for each multiply represented compound that you want to use for that compound. Loading then would skip that compound in the other files (thereby eliminating the need to edit the peak annotation files).
Michael made a mockup of one way in which he imagined the selection would be made:
But one idea we had in the meeting was to make the selection in a sheet (named
Peak Group Conflicts
) in the study doc. One thought I had would be to make a sheet with the following columns:What the loader needs to know is the ones to skip. But given we know the sequence each peak annot file is derived from, we can just skip the listed compound when it occurs in an annot file belonging to the same sequence.
Alternatives
The problem with the feature described above is that it only records the desired file the representation to load is in. An alternative is to have a row for every compound, sequence, and file (instead of a dropdown) and add an extra column (either "skip" or "keep"), but that's not as easy on the user. However, it would also allow us to provide metadata, like the median ion count associated with that compound in each file.
The suggested feature above also relies on the assumption that a peak annotation file only ever contains data from 1 sequence. If it has data from multiple sequences, you could end up skipping a compound from the wrong sequence (where there is not a multiple representation - or some other complex association).
Dependencies
This issue cannot be fully completed until the following are completed:
Comment
So the process would go like this:
Just note that this does not account for the possibility that different peak annot files could have complementary complements of samples. Since we're not recording the files where this compound must be skipped, there's the theoretical possibility that the compound could be skipped in a file where it shouldn't be skipped (because it simply contains different samples).
ISSUE OWNER SECTION
Assumptions
Limitations
Affected Components
DataRepo/loaders/peak_annotations_loader.py
DataRepo/loaders/peak_annotation_files_loader.py
DataRepo/management/commands/load_peak_annotations.py
DataRepo/management/commands/load_peak_annotation_files.py
DataRepo/views/upload/submission.py
DataRepo/loaders/peak_group_conflicts.py
Requirements
1.
Users are able to select a peak annotation file name for each multiply represented peak group compound synonym from all peak annotation files that contain that peak group compound synonym for the same samples/sequence.2.
A peak group will only be loaded from the peak annotation file the user selected. I.e. that compound for the same sample/sequence in other files will be skipped.3.
Peak Groups whose names only differ by case will be treated as multiply represented4.
If a peak group name includes synonyms from multiple compounds and at least 1 synonym differs (by more than just case) from that of another peak group name, the peak group will be loaded with all synonyms included and will not be considered to be a multiply represented peak group compound.5.
The order of compound synonyms in a peak group name will not make them different if they have the same set of compound synonyms.6.
A peak group with a subset of compound synonyms of another peak group constitutes a different peak group (i.e. not multiply represented).DESIGN
Interface Change description
The loaders and commands will have new options for file, sheet, and/or dataframe for the multiply represented peak group compounds.
The excel study doc from the submission interface's start page will contain a "Peak Group Conflicts" sheet if multiple representations exist. Only conflicting peak group compounds will be included.
Code Change Description
A new loader
PeakGroupConflicts
will be created, whoseload_data
method just returns data (it performs no load). The data will be a dict of file names keyed on sequence name and sorted and lower-cased peak group name. It will define a sheet namedPeak Group Conflicts
that contains the following columns in the same manner as all of the other loaders:Conflicting Peak Group
Sequence Name
Select a Peak Annotation File
A selected file will be required on every row. The
Conflicting Peak Group
andSequence Name
combo will be required to be unique.Both the
PeakAnnotationsLoader
andPeakAnnotationFilesLoader
(and their respective load command scripts) will take a file and sheet name, and the loaders will take a dataframe as well.The
PeakAnnotationsLoader
will callPeakGroupConflicts.load_data()
and save the returned dict and then test if that sequence and peak group name exists, and if it does, it checks that its file name matches. If it does not match and the sample is in both files, the peak group is skipped.Tests
A test for each requirement