Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

chore: cleanup duplicate peakgroup records from tracebase-dev #974

Open lparsons opened 4 months ago

lparsons commented 4 months ago

FEATURE REQUEST

Inspiration

After upgrading tracebase-dev to the lastest version in main, we have quite a few "duplicate" peak group records hat use "fake" mzXML files. These were created by the migration in #949.

Description

We should asses what the best, most appropriate action is for each duplicate record in tracebase-dev and determine how best to update the database (and potentially underlying datasets in tracebase-rabinowitz-data). These records can be identified by the "fake" mzXML file records, e.g. Archive File Record - SampleSample object (3765)_Sequence46_Michael Neinast_Dupe1_PeakGroupsToAddress-3-Ureidopropionic acid,creatine,cytidine,thymidine.mzXML (Sample3765_Sequence46_Dupe1)

Alternatives

None

Dependencies

Comment

None


ISSUE OWNER SECTION

Assumptions

  1. List of assumptions that the code will not explicitly address/check
  2. E.g. We will assume input is correct (explaining why there is no validation)

Limitations

  1. A list of things this work will specifically not do
  2. E.g. This feature will only handle the most frequent use case X

Affected Components

Requirements

DESIGN

Interface Change description

None provided

Code Change Description

None provided

Tests

hepcat72 commented 1 month ago

We should asses what the best, most appropriate action is for each duplicate record in tracebase-dev and determine how best to update the database (and potentially underlying datasets in tracebase-rabinowitz-data).

My strategy would be to do the update in the shell. Create a queryset, and manually go through them 1 by one. Off the top of my head, this is how I would frame the process:

hepcat72 commented 1 month ago

Michael denoted which ones must be removed/retained in this issue comment in the rabinowitz repo.

hepcat72 commented 1 month ago

To copy that here...

2023-07-18

2022-09-09

Delete the peak groups with these names, and that link to these peak annotation files:

{
    "3-Ureidopropionic acid": [
        "exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv",
        "exp048a_Carn_pos_highmz_corrected.xlsx",
    ],
    "carnosine": ["exp048a_Carn_pos_highmz_corrected.xlsx"],
    "creatine": ["exp048a_BetaalaHist_negative_cor_ion_counts.csv"],
    "cytidine": [
        "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
        "exp048a_Carn_neg_corrected.xlsx",
    ],
    "thymidine": [
        "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
        "exp048a_Carn_neg_corrected.xlsx",
    ],
    "arginine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
    "lysine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
}
hepcat72 commented 1 month ago

Here's what I think we should do. This is a very rough sketch. It should be put inside an atomic transaction. I'll write it up more formally and create a PR. This just prints the commands we should execute.

In [11]: from DataRepo.models import PeakGroup, MSRunSample

In [11]: pgs = PeakGroup.objects.all()
    ...: ddict = {}
    ...: for pg in pgs:
    ...:     pgk = f"{pg.name} {pg.msrun_sample.sample} {pg.msrun_sample.msrun_sequence}"
    ...:     if pgk not in ddict.keys():
    ...:         ddict[pgk] = []
    ...:     ddict[pgk].append(pg)

In [11]: todelete = {
    ...:     "3-Ureidopropionic acid": [
    ...:         "exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv",
    ...:         "exp048a_Carn_pos_highmz_corrected.xlsx",
    ...:     ],
    ...:     "carnosine": ["exp048a_Carn_pos_highmz_corrected.xlsx"],
    ...:     "creatine": ["exp048a_BetaalaHist_negative_cor_ion_counts.csv"],
    ...:     "cytidine": [
    ...:         "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
    ...:         "exp048a_Carn_neg_corrected.xlsx",
    ...:     ],
    ...:     "thymidine": [
    ...:         "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
    ...:         "exp048a_Carn_neg_corrected.xlsx",
    ...:     ],
    ...:     "arginine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
    ...:     "lysine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
    ...: }

In [14]: msrs_to_move = {}
    ...: for dpgl in [pgl for pgl in ddict.values() if len(pgl) > 1]:
    ...:     print(f"{dpgl[0].msrun_sample.msrun_sequence}\n{dpgl[0].msrun_sample.sample.name}\n\t{dpg.name}")
    ...:     tomove = None
    ...:     totake = None
    ...:     for dpg in dpgl:
    ...:         print(f"\t\tMSRunSample {dpg.msrun_sample.id} with {dpg.msrun_sample.ms_data_file.filename} has {dpg.msrun_sample.peak_groups.count()} peak groups")
    ...:         if dpg.name in todelete.keys() and dpg.peak_annotation_file.filename in todelete[dpg.name]:
    ...:             print(f"\t\t\tDELETE: PeakGroup {dpg.id} from '{dpg.peak_annotation_file.filename}'")
    ...:             tomove = dpg.msrun_sample.id
    ...:             print(f"dpg.delete()  # Deleting duplicate compound PeakGroup {dpg.id}")
    ...:         else:
    ...:             totake = dpg.msrun_sample.id
    ...:             print(f"\t\t\tTOKEEP: PeakGroup {dpg.id} from '{dpg.peak_annotation_file.filename}'")
    ...:     msrs_to_move[tomove] = totake
    ...: for move, take in msrs_to_move.items():
    ...:     moveobj = MSRunSample.objects.get(id=move)
    ...:     takeobj = MSRunSample.objects.get(id=take)
    ...:     for pg in moveobj.peak_groups.all():
    ...:         print(f"pg.msrun_sample = takeobj.id  # Moving PeakGroup {pg.id} ({pg.name}) to Placeholder MSRunSample {takeobj.id}")
    ...:         print("pg.save()")
    ...:     print(f"moveobj.delete()  # Deleting MSRunSample {moveobj.id}")
    ...:     print(f"takeobj.ms_data_file = None  # Deleting fake mzXML '{takeobj.ms_data_file.filename}'")
    ...:     print("takeobj.save()")
hepcat72 commented 49 minutes ago

This issue is technically done, but there is the issue of the peak groups for the same samples in different sequences (since when these representations were identified, them being in different sequences was OK/accounted-for), but it appears that all of them are blank samples? (I thought we always skipped blanks - I'm looking at the details in a shell now.):

hepcat72 commented 37 minutes ago

Looks like they're all in the cold exposure study. Here are some details...

3-hydroxybutyrate   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
C18:1   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
C18:2   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
citrate/isocitrate  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
creatine    col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
glutamate   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
glutamine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
homocarnosine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
isoleucine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
lactate col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
leucine col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
malate  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
methionine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
phenylalanine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
proline col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
pyruvate    col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
serine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
succinate   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
threonine   col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
tryptophan  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
valine  col005d_blank2  ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']

And here's the shell code to do it:

from DataRepo.models import PeakGroup
from collections import defaultdict
pgs = PeakGroup.objects.all()
ddict = defaultdict(lambda: defaultdict(dict))
for pg in pgs:
    pgk = f"{pg.name} {pg.msrun_sample.sample}"
    ddict[pg.name][pg.msrun_sample.sample.name][pg.peak_annotation_file.filename] = pg.msrun_sample.sample.animal.studies
for pgn in ddict.keys():
    for sample in [s for s in ddict[pgn].keys() if len(ddict[pgn][s].keys()) > 1]:
        print(f"{pgn}\t{sample}\t{list(ddict[pgn][sample].keys())}\t{[','.join([s.name for s in r.all()]) for r in list(ddict[pgn][sample].values())]}")
hepcat72 commented 29 minutes ago

OK. It's all 1 sample (col005d_blank2) and that 1 isn't in the yaml. There's a note on it in the changes doc:

Note, "col005d_blank2" was explicitly NOT skipped because it's not actually a blank. Michael said that there are a chunk of samples whose names are offset. E.g. "col005d_blank2" is actually "col005d_D_01". Correspondingly, "col005d_D_12" is a blank.

I find it a tad concerning that it is in multiple sequences and it alone is the only one with multiple representations, since it is a real sample and not a blank. Going to dig a bit further.

hepcat72 commented 19 minutes ago

@mneinast - OK, I suspect that maybe col005d_blank2 is only actually sample "col005d_D_01" in one of these 2 files:

And that one of those samples needs to be deleted. Can you confirm this? And if I'm right, we need to delete every peak group associated with the actual blank (not just the ones with multiple representations).

@lparsons - don't know if you'd like to take a look at this (I don't think you need to - I think I've got it - I just thought you might want to).