chore: cleanup duplicate peakgroup records from tracebase-dev

FEATURE REQUEST

Inspiration

After upgrading tracebase-dev to the lastest version in main, we have quite a few "duplicate" peak group records hat use "fake" mzXML files. These were created by the migration in #949.

Description

We should asses what the best, most appropriate action is for each duplicate record in tracebase-dev and determine how best to update the database (and potentially underlying datasets in tracebase-rabinowitz-data). These records can be identified by the "fake" mzXML file records, e.g. Archive File Record - SampleSample object (3765)_Sequence46_Michael Neinast_Dupe1_PeakGroupsToAddress-3-Ureidopropionic acid,creatine,cytidine,thymidine.mzXML (Sample3765_Sequence46_Dupe1)

Alternatives

None

Dependencies

Comment

None

ISSUE OWNER SECTION

Assumptions

List of assumptions that the code will not explicitly address/check
E.g. We will assume input is correct (explaining why there is no validation)

Limitations

A list of things this work will specifically not do
E.g. This feature will only handle the most frequent use case X

Affected Components

change: File path or DB table ...
add: Environment variable or server setting
delete: External executable or cron job

Requirements

[ ] 1. List of numbered conditions to be met for the feature
[ ] 2. E.g. Every column/row must display a value, i.e. cannot be empty
[ ] 3. Numbers for reference & checkboxes for progress tracking

DESIGN

Interface Change description

None provided

Code Change Description

None provided

Tests

[ ] 1. A description of at least one test for each requirement above.
[ ] 2. E.g. Test for req 2 that there's an exception when display value is ''
[ ] 3. Numbers for reference & checkboxes for progress tracking

We should asses what the best, most appropriate action is for each duplicate record in tracebase-dev and determine how best to update the database (and potentially underlying datasets in tracebase-rabinowitz-data).

My strategy would be to do the update in the shell. Create a queryset, and manually go through them 1 by one. Off the top of my head, this is how I would frame the process:

Some compounds were loaded as peak group records from 2 different files for the same sample and sequence.
They exist as MSRunSample records and are allowed to exist in the new unique constraints because they each have a fake mzXML archive file record.
For each compound, we need to decide which MSRunSample record we want to keep and which one we want to delete, based on Michael's input
We need to delete all ~MSRunSample records and~ associated PeakGroup records except the one Michael wants to keep in each compound/peak annotation file case, So we:
- Delete PeakGroup records linked to the one we're deleting (deletions should cascade through the PeakData table, for example, but I'm not 100% clear on that, so we'll have to see what happens)
- Then update the remaining MSRunSample record pairs to:
- Remove the fake mzXML file from one of them.
- Relink all peak groups from one to the other.
- Delete the empty MSRunSample record

Michael denoted which ones must be removed/retained in this issue comment in the rabinowitz repo.

To copy that here...

2023-07-18

3-Ureidopropionic acid
- ~~exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv~~
- exp048a_BetaalaHist_negative_cor_ion_counts.csv
3-Ureidopropionic acid
- ~~exp048a_Carn_pos_highmz_corrected.xlsx~~
- exp048a_Carn_neg_corrected.xlsx
carnosine
- ~~exp048a_Carn_pos_highmz_corrected.xlsx~~
- exp048a_Carn_neg_corrected.xlsx
creatine
- exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv
- ~~exp048a_BetaalaHist_negative_cor_ion_counts.csv~~
cytidine
- exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv
- ~~exp048a_BetaalaHist_negative_cor_ion_counts.csv~~
cytidine
- exp048a_Carn_pos_highmz_corrected.xlsx
- ~~exp048a_Carn_neg_corrected.xlsx~~
thymidine
- exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv
- ~~exp048a_BetaalaHist_negative_cor_ion_counts.csv~~
thymidine
- exp048a_Carn_pos_highmz_corrected.xlsx
- ~~exp048a_Carn_neg_corrected.xlsx~~

2022-09-09

arginine
- exp027f4_free_plasma and tissues_pos high mz_corrected.xlsx
- ~~exp027f4_free_plasma and tissues_negative_corrected.xlsx~~
lysine
- exp027f4_free_plasma and tissues_pos high mz_corrected.xlsx
- ~~exp027f4_free_plasma and tissues_negative_corrected.xlsx~~

Delete the peak groups with these names, and that link to these peak annotation files:

{
    "3-Ureidopropionic acid": [
        "exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv",
        "exp048a_Carn_pos_highmz_corrected.xlsx",
    ],
    "carnosine": ["exp048a_Carn_pos_highmz_corrected.xlsx"],
    "creatine": ["exp048a_BetaalaHist_negative_cor_ion_counts.csv"],
    "cytidine": [
        "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
        "exp048a_Carn_neg_corrected.xlsx",
    ],
    "thymidine": [
        "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
        "exp048a_Carn_neg_corrected.xlsx",
    ],
    "arginine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
    "lysine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
}

Here's what I think we should do. This is a very rough sketch. It should be put inside an atomic transaction. I'll write it up more formally and create a PR. This just prints the commands we should execute.

In [11]: from DataRepo.models import PeakGroup, MSRunSample

In [11]: pgs = PeakGroup.objects.all()
    ...: ddict = {}
    ...: for pg in pgs:
    ...:     pgk = f"{pg.name} {pg.msrun_sample.sample} {pg.msrun_sample.msrun_sequence}"
    ...:     if pgk not in ddict.keys():
    ...:         ddict[pgk] = []
    ...:     ddict[pgk].append(pg)

In [11]: todelete = {
    ...:     "3-Ureidopropionic acid": [
    ...:         "exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv",
    ...:         "exp048a_Carn_pos_highmz_corrected.xlsx",
    ...:     ],
    ...:     "carnosine": ["exp048a_Carn_pos_highmz_corrected.xlsx"],
    ...:     "creatine": ["exp048a_BetaalaHist_negative_cor_ion_counts.csv"],
    ...:     "cytidine": [
    ...:         "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
    ...:         "exp048a_Carn_neg_corrected.xlsx",
    ...:     ],
    ...:     "thymidine": [
    ...:         "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
    ...:         "exp048a_Carn_neg_corrected.xlsx",
    ...:     ],
    ...:     "arginine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
    ...:     "lysine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
    ...: }

In [14]: msrs_to_move = {}
    ...: for dpgl in [pgl for pgl in ddict.values() if len(pgl) > 1]:
    ...:     print(f"{dpgl[0].msrun_sample.msrun_sequence}\n{dpgl[0].msrun_sample.sample.name}\n\t{dpg.name}")
    ...:     tomove = None
    ...:     totake = None
    ...:     for dpg in dpgl:
    ...:         print(f"\t\tMSRunSample {dpg.msrun_sample.id} with {dpg.msrun_sample.ms_data_file.filename} has {dpg.msrun_sample.peak_groups.count()} peak groups")
    ...:         if dpg.name in todelete.keys() and dpg.peak_annotation_file.filename in todelete[dpg.name]:
    ...:             print(f"\t\t\tDELETE: PeakGroup {dpg.id} from '{dpg.peak_annotation_file.filename}'")
    ...:             tomove = dpg.msrun_sample.id
    ...:             print(f"dpg.delete()  # Deleting duplicate compound PeakGroup {dpg.id}")
    ...:         else:
    ...:             totake = dpg.msrun_sample.id
    ...:             print(f"\t\t\tTOKEEP: PeakGroup {dpg.id} from '{dpg.peak_annotation_file.filename}'")
    ...:     msrs_to_move[tomove] = totake
    ...: for move, take in msrs_to_move.items():
    ...:     moveobj = MSRunSample.objects.get(id=move)
    ...:     takeobj = MSRunSample.objects.get(id=take)
    ...:     for pg in moveobj.peak_groups.all():
    ...:         print(f"pg.msrun_sample = takeobj.id  # Moving PeakGroup {pg.id} ({pg.name}) to Placeholder MSRunSample {takeobj.id}")
    ...:         print("pg.save()")
    ...:     print(f"moveobj.delete()  # Deleting MSRunSample {moveobj.id}")
    ...:     print(f"takeobj.ms_data_file = None  # Deleting fake mzXML '{takeobj.ms_data_file.filename}'")
    ...:     print("takeobj.save()")

This issue is technically done, but there is the issue of the peak groups for the same samples in different sequences (since when these representations were identified, them being in different sequences was OK/accounted-for), but it appears that all of them are blank samples? (I thought we always skipped blanks - I'm looking at the details in a shell now.):

3-hydroxybutyrate col005d_blank2: 2
C18:1 col005d_blank2: 2
C18:2 col005d_blank2: 2
citrate/isocitrate col005d_blank2: 2
creatine col005d_blank2: 2
glutamate col005d_blank2: 2
glutamine col005d_blank2: 2
homocarnosine col005d_blank2: 2
isoleucine col005d_blank2: 2
lactate col005d_blank2: 2
leucine col005d_blank2: 2
malate col005d_blank2: 2
methionine col005d_blank2: 2
phenylalanine col005d_blank2: 2
proline col005d_blank2: 2
pyruvate col005d_blank2: 2
serine col005d_blank2: 2
succinate col005d_blank2: 2
threonine col005d_blank2: 2
tryptophan col005d_blank2: 2
valine col005d_blank2: 2

Looks like they're all in the cold exposure study. Here are some details...

3-hydroxybutyrate   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
C18:1   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
C18:2   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
citrate/isocitrate  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
creatine    col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
glutamate   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
glutamine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
homocarnosine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
isoleucine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
lactate col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
leucine col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
malate  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
methionine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
phenylalanine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
proline col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
pyruvate    col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
serine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
succinate   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
threonine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
tryptophan  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
valine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']

And here's the shell code to do it:

from DataRepo.models import PeakGroup
from collections import defaultdict
pgs = PeakGroup.objects.all()
ddict = defaultdict(lambda: defaultdict(dict))
for pg in pgs:
    pgk = f"{pg.name} {pg.msrun_sample.sample}"
    ddict[pg.name][pg.msrun_sample.sample.name][pg.peak_annotation_file.filename] = pg.msrun_sample.sample.animal.studies
for pgn in ddict.keys():
    for sample in [s for s in ddict[pgn].keys() if len(ddict[pgn][s].keys()) > 1]:
        print(f"{pgn}\t{sample}\t{list(ddict[pgn][sample].keys())}\t{[','.join([s.name for s in r.all()]) for r in list(ddict[pgn][sample].values())]}")

OK. It's all 1 sample (col005d_blank2) and that 1 isn't in the yaml. There's a note on it in the changes doc:

Note, "col005d_blank2" was explicitly NOT skipped because it's not actually a blank. Michael said that there are a chunk of samples whose names are offset. E.g. "col005d_blank2" is actually "col005d_D_01". Correspondingly, "col005d_D_12" is a blank.

I find it a tad concerning that it is in multiple sequences and it alone is the only one with multiple representations, since it is a real sample and not a blank. Going to dig a bit further.

@mneinast - OK, I suspect that maybe col005d_blank2 is only actually sample "col005d_D_01" in one of these 2 files:

col005d_dia and iwat_full scan_corrected.xlsx
col005d_plasma negative_corrected.xlsx

And that one of those samples needs to be deleted. Can you confirm this? And if I'm right, we need to delete every peak group associated with the actual blank (not just the ones with multiple representations).

@lparsons - don't know if you'd like to take a look at this (I don't think you need to - I think I've got it - I just thought you might want to).

Princeton-LSI-ResearchComputing / tracebase