Open lparsons opened 4 months ago
We should asses what the best, most appropriate action is for each duplicate record in tracebase-dev and determine how best to update the database (and potentially underlying datasets in tracebase-rabinowitz-data).
My strategy would be to do the update in the shell. Create a queryset, and manually go through them 1 by one. Off the top of my head, this is how I would frame the process:
Michael denoted which ones must be removed/retained in this issue comment in the rabinowitz repo.
To copy that here...
2023-07-18
2022-09-09
Delete the peak groups with these names, and that link to these peak annotation files:
{
"3-Ureidopropionic acid": [
"exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv",
"exp048a_Carn_pos_highmz_corrected.xlsx",
],
"carnosine": ["exp048a_Carn_pos_highmz_corrected.xlsx"],
"creatine": ["exp048a_BetaalaHist_negative_cor_ion_counts.csv"],
"cytidine": [
"exp048a_BetaalaHist_negative_cor_ion_counts.csv",
"exp048a_Carn_neg_corrected.xlsx",
],
"thymidine": [
"exp048a_BetaalaHist_negative_cor_ion_counts.csv",
"exp048a_Carn_neg_corrected.xlsx",
],
"arginine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
"lysine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
}
Here's what I think we should do. This is a very rough sketch. It should be put inside an atomic transaction. I'll write it up more formally and create a PR. This just prints the commands we should execute.
In [11]: from DataRepo.models import PeakGroup, MSRunSample
In [11]: pgs = PeakGroup.objects.all()
...: ddict = {}
...: for pg in pgs:
...: pgk = f"{pg.name} {pg.msrun_sample.sample} {pg.msrun_sample.msrun_sequence}"
...: if pgk not in ddict.keys():
...: ddict[pgk] = []
...: ddict[pgk].append(pg)
In [11]: todelete = {
...: "3-Ureidopropionic acid": [
...: "exp048a_BetaalaHist_pos_highmz_cor_ion_counts.csv",
...: "exp048a_Carn_pos_highmz_corrected.xlsx",
...: ],
...: "carnosine": ["exp048a_Carn_pos_highmz_corrected.xlsx"],
...: "creatine": ["exp048a_BetaalaHist_negative_cor_ion_counts.csv"],
...: "cytidine": [
...: "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
...: "exp048a_Carn_neg_corrected.xlsx",
...: ],
...: "thymidine": [
...: "exp048a_BetaalaHist_negative_cor_ion_counts.csv",
...: "exp048a_Carn_neg_corrected.xlsx",
...: ],
...: "arginine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
...: "lysine": ["exp027f4_free_plasma and tissues_negative_corrected.xlsx"],
...: }
In [14]: msrs_to_move = {}
...: for dpgl in [pgl for pgl in ddict.values() if len(pgl) > 1]:
...: print(f"{dpgl[0].msrun_sample.msrun_sequence}\n{dpgl[0].msrun_sample.sample.name}\n\t{dpg.name}")
...: tomove = None
...: totake = None
...: for dpg in dpgl:
...: print(f"\t\tMSRunSample {dpg.msrun_sample.id} with {dpg.msrun_sample.ms_data_file.filename} has {dpg.msrun_sample.peak_groups.count()} peak groups")
...: if dpg.name in todelete.keys() and dpg.peak_annotation_file.filename in todelete[dpg.name]:
...: print(f"\t\t\tDELETE: PeakGroup {dpg.id} from '{dpg.peak_annotation_file.filename}'")
...: tomove = dpg.msrun_sample.id
...: print(f"dpg.delete() # Deleting duplicate compound PeakGroup {dpg.id}")
...: else:
...: totake = dpg.msrun_sample.id
...: print(f"\t\t\tTOKEEP: PeakGroup {dpg.id} from '{dpg.peak_annotation_file.filename}'")
...: msrs_to_move[tomove] = totake
...: for move, take in msrs_to_move.items():
...: moveobj = MSRunSample.objects.get(id=move)
...: takeobj = MSRunSample.objects.get(id=take)
...: for pg in moveobj.peak_groups.all():
...: print(f"pg.msrun_sample = takeobj.id # Moving PeakGroup {pg.id} ({pg.name}) to Placeholder MSRunSample {takeobj.id}")
...: print("pg.save()")
...: print(f"moveobj.delete() # Deleting MSRunSample {moveobj.id}")
...: print(f"takeobj.ms_data_file = None # Deleting fake mzXML '{takeobj.ms_data_file.filename}'")
...: print("takeobj.save()")
This issue is technically done, but there is the issue of the peak groups for the same samples in different sequences (since when these representations were identified, them being in different sequences was OK/accounted-for), but it appears that all of them are blank samples? (I thought we always skipped blanks - I'm looking at the details in a shell now.):
Looks like they're all in the cold exposure study. Here are some details...
3-hydroxybutyrate col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
C18:1 col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
C18:2 col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
citrate/isocitrate col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
creatine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
glutamate col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
glutamine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
homocarnosine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
isoleucine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
lactate col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
leucine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
malate col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
methionine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
phenylalanine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
proline col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
pyruvate col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
serine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
succinate col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
threonine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
tryptophan col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
valine col005d_blank2 ['col005d_dia and iwat_full scan_corrected.xlsx', 'col005d_plasma negative_corrected.xlsx'] ['Cold Exposure', 'Cold Exposure']
And here's the shell code to do it:
from DataRepo.models import PeakGroup
from collections import defaultdict
pgs = PeakGroup.objects.all()
ddict = defaultdict(lambda: defaultdict(dict))
for pg in pgs:
pgk = f"{pg.name} {pg.msrun_sample.sample}"
ddict[pg.name][pg.msrun_sample.sample.name][pg.peak_annotation_file.filename] = pg.msrun_sample.sample.animal.studies
for pgn in ddict.keys():
for sample in [s for s in ddict[pgn].keys() if len(ddict[pgn][s].keys()) > 1]:
print(f"{pgn}\t{sample}\t{list(ddict[pgn][sample].keys())}\t{[','.join([s.name for s in r.all()]) for r in list(ddict[pgn][sample].values())]}")
OK. It's all 1 sample (col005d_blank2
) and that 1 isn't in the yaml
. There's a note on it in the changes doc:
Note, "col005d_blank2" was explicitly NOT skipped because it's not actually a blank. Michael said that there are a chunk of samples whose names are offset. E.g. "col005d_blank2" is actually "col005d_D_01". Correspondingly, "col005d_D_12" is a blank.
I find it a tad concerning that it is in multiple sequences and it alone is the only one with multiple representations, since it is a real sample and not a blank. Going to dig a bit further.
@mneinast - OK, I suspect that maybe col005d_blank2
is only actually sample "col005d_D_01
" in one of these 2 files:
col005d_dia and iwat_full scan_corrected.xlsx
col005d_plasma negative_corrected.xlsx
And that one of those samples needs to be deleted. Can you confirm this? And if I'm right, we need to delete every peak group associated with the actual blank (not just the ones with multiple representations).
@lparsons - don't know if you'd like to take a look at this (I don't think you need to - I think I've got it - I just thought you might want to).
FEATURE REQUEST
Inspiration
After upgrading
tracebase-dev
to the lastest version inmain
, we have quite a few "duplicate" peak group records hat use "fake" mzXML files. These were created by the migration in #949.Description
We should asses what the best, most appropriate action is for each duplicate record in tracebase-dev and determine how best to update the database (and potentially underlying datasets in
tracebase-rabinowitz-data
). These records can be identified by the "fake" mzXML file records, e.g. Archive File Record - SampleSample object (3765)_Sequence46_Michael Neinast_Dupe1_PeakGroupsToAddress-3-Ureidopropionic acid,creatine,cytidine,thymidine.mzXML (Sample3765_Sequence46_Dupe1)Alternatives
None
Dependencies
Comment
None
ISSUE OWNER SECTION
Assumptions
Limitations
Affected Components
Requirements
DESIGN
Interface Change description
None provided
Code Change Description
None provided
Tests