Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Create a column in the `Peak Group Conflicts` sheet to indicate pre-existing `PeakGroup`s in the DB #1232

Open hepcat72 opened 2 months ago

hepcat72 commented 2 months ago
          I'm a little confused.  Isn't that what we're doing / what I'm proposing?  We construct the `Peak Group Conflicts` sheet including the existing DB records, for the researcher to select the PeakGroup from the file that optimally represents the compound (whether it's already in the DB or not).  When they perform validation, they will get the `ReplacingPeakGroupRepresentation` warning describing the deletion of an existing `PeakGroup` and its replacement with the one from the selected file.  **addendum**: By not going back and changing their selection, based on that warning, they are choosing to proceed with the deletion.

Perhaps your concern is about the timing - i.e., you're suggesting that at the time of the selection of the file from which a peak group is derived, it should be noted in some way that a representation of the PeakGroup already exists in the database?

I like the idea of providing that context during the selection. We have access to that data. We could add a column that says that a PeakGroup already exists from one of the files the user is prompted to select.

The one question I have is, why does that matter? I'm not saying it doesn't. I'm just saying, how could that information affect the user's selection and why might it?

_Originally posted by @hepcat72 in https://github.com/Princeton-LSI-ResearchComputing/tracebase/pull/1222#discussion_r1777584331_

hepcat72 commented 2 months ago

Hey @mneinast, I would like to get your take on this. The issue description hasn't been edited yet to be very clear, but this is based on the discussion Lance and I had in the comment linked in the issue description...

We're wondering about the ramifications of how we handle the following situation: Since we allow/encourage users to submit study data before the study data has been fully compiled, it is technically^ possible that data could be loaded before we know about any multiple representations. On a subsequent load, multiple representations are detected and the researcher is prompted to select the best representation using that conflicts sheet. Thus it is possible that a previously loaded PeakGroup would have to be deleted (if the user selects the peak group from the new file) so that the selected representation can be loaded.

This is currently dealt with in PR #1225 in the following manner:

When a multiple representation exists, and the "not selected" PeakGroup already exists in the database:

  1. The existing PeakGroup is deleted and its replacement is loaded
  2. During the validate step, a ReplacingPeakGroupRepresentation warning is presented to the user that informs them that the previously loaded PeakGroup will be removed. They can choose to edit their selection in the conflicts sheet, if they don't want to delete the existing peak group.

Lance has expressed some concern over this overall mechanism, and I agree that this should be thought through a bit more, so I would like to explore the following:

  1. What are the potential downsides of deleting existing peak groups, i.e. how can this "bite" users?
  2. How much time can pass between MSRuns running on the same samples (i.e. how long can the samples be stored between runs)? (This speaks to the likelihood of this scenario arising.^)
  3. If a researcher were to perform any analyses on existing data, and then load more data that replaces a peak group, how can that affect the previously performed analysis analysis? (and how likely is this to happen?)
  4. What are any other thoughts you may have on deleting existing PeakGroups in this fashion?

Lance proposed an alternate mechanism of alerting the user to the chance of deleting existing peak groups and I kind of like the idea, and that is to include in some way the fact that a peak group pre-exists in the conflicts sheet. We had different ideas on how to do that, but I'd be interested in hearing what idea you might come up with without hearing the strategies we discussed.

^ How likely is it that a peak annotation file could contain data on the same samples that were previously loaded? A user could of course sit on existing data (from the same MSRun) and load a complementary positive/negative scan at any time.