Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

task: generate a list of peak groups for the same compound and same sample #1201

Closed lparsons closed 1 month ago

lparsons commented 1 month ago
          > Yeah, no. I'm not talking about different time points. I'm talking about the same samples run a second time. We have that in the DB currently and there is logic to prefer the run/sequence that was later.

It would be useful to generate a list of these and take a close look at the examples. It would also be useful to list out the various pieces of logic that prefer a later sequence.

_Originally posted by @lparsons in https://github.com/Princeton-LSI-ResearchComputing/tracebase/pull/1196#discussion_r1747631333_

hepcat72 commented 1 month ago

So here are the PeakGroups that have multiple representations on dev.

First though, these are the multiple compound representations per sample AND sequence. There are 180. (There are 201 when you exclude sequence from the unique constraint - scroll below for that.) Note, I never "committed" the run of the code from #1167 on dev. I only ever ran it in dry-run mode. Note also that I list the results here by compound, sample, and sequence and filter for only those whose peakgroups number greater than 1:

In [22]: for dct in PeakGroup.objects.values("name", "msrun_sample__sample__name", "msrun_sample__msrun_sequence__id").annotate(pgs_per_sample_seq=Count("name")).filter(pgs_per_sample_seq__gt=1):
    ...:     print(f"{dct['name']}\t{dct['msrun_sample__sample__name']}\tMSRunSequence {dct['msrun_sample__msrun_sequence__id']}\t{dct['pgs_per_sample_seq']}")
    ...: print(f"Number of peak groups with multiple representations (in a sample / sequence): {PeakGroup.objects.values('name', 'msrun_sample__sample__name', 'msrun_sample__msrun_sequence__id').annotate(pgs_per_sample_seq=Count('name')).filter(pgs_per_sample_seq__gt=1).count()}")
3-Ureidopropionic acid  exp048a_01_1080 MSRunSequence 46    2
creatine    exp048a_05_0240 MSRunSequence 46    2
thymidine   exp048a_03_0240 MSRunSequence 46    2
thymidine   exp048a_02_0000 MSRunSequence 46    2
thymidine   exp048a_03_0000 MSRunSequence 46    2
cytidine    exp048a_04_0240 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_02_0060 MSRunSequence 46    2
cytidine    exp048a_05_0240 MSRunSequence 46    2
cytidine    exp048a_01_0060 MSRunSequence 46    2
cytidine    exp048a_04_0060 MSRunSequence 46    2
cytidine    exp048a_03_0000 MSRunSequence 46    2
creatine    exp048a_03_0240 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_01_1440 MSRunSequence 46    2
cytidine    exp048a_02_0020 MSRunSequence 46    2
cytidine    exp048a_03_0060 MSRunSequence 46    2
creatine    exp048a_04_0240 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_04_0000 MSRunSequence 46    2
creatine    exp048a_05_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_02_0000 MSRunSequence 46    2
cytidine    exp048a_02_0240 MSRunSequence 46    2
creatine    exp048a_04_0060 MSRunSequence 46    2
creatine    exp048a_01_0000 MSRunSequence 46    2
cytidine    exp048a_05_0060 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_02_0240 MSRunSequence 46    2
thymidine   exp048a_05_0240 MSRunSequence 46    2
cytidine    exp048a_02_0060 MSRunSequence 46    2
cytidine    exp048a_01_1440 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_05_0060 MSRunSequence 46    2
creatine    exp048a_02_0060 MSRunSequence 46    2
thymidine   exp048a_01_1440 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_04_0020 MSRunSequence 46    2
thymidine   exp048a_01_0060 MSRunSequence 46    2
cytidine    exp048a_03_0020 MSRunSequence 46    2
thymidine   exp048a_03_0060 MSRunSequence 46    2
creatine    exp048a_04_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_04_0240 MSRunSequence 46    2
thymidine   exp048a_04_0000 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_05_0000 MSRunSequence 46    2
creatine    exp048a_04_0000 MSRunSequence 46    2
thymidine   exp048a_01_0000 MSRunSequence 46    2
creatine    exp048a_01_1440 MSRunSequence 46    2
thymidine   exp048a_02_0240 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_01_0240 MSRunSequence 46    2
creatine    exp048a_05_0060 MSRunSequence 46    2
cytidine    exp048a_04_0000 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_01_0000 MSRunSequence 46    2
thymidine   exp048a_05_0060 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_03_0240 MSRunSequence 46    2
thymidine   exp048a_01_0240 MSRunSequence 46    2
thymidine   exp048a_04_0060 MSRunSequence 46    2
thymidine   exp048a_03_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_04_0060 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_01_0020 MSRunSequence 46    2
creatine    exp048a_01_0240 MSRunSequence 46    2
creatine    exp048a_03_0020 MSRunSequence 46    2
thymidine   exp048a_05_0000 MSRunSequence 46    2
creatine    exp048a_01_1080 MSRunSequence 46    2
thymidine   exp048a_04_0240 MSRunSequence 46    2
creatine    exp048a_02_0000 MSRunSequence 46    2
cytidine    exp048a_01_0240 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_03_0020 MSRunSequence 46    2
thymidine   exp048a_01_0020 MSRunSequence 46    2
creatine    exp048a_03_0000 MSRunSequence 46    2
cytidine    exp048a_03_0240 MSRunSequence 46    2
thymidine   exp048a_01_1080 MSRunSequence 46    2
cytidine    exp048a_04_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_02_0020 MSRunSequence 46    2
thymidine   exp048a_02_0020 MSRunSequence 46    2
cytidine    exp048a_01_1080 MSRunSequence 46    2
cytidine    exp048a_02_0000 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_01_0060 MSRunSequence 46    2
creatine    exp048a_05_0000 MSRunSequence 46    2
creatine    exp048a_01_0060 MSRunSequence 46    2
creatine    exp048a_01_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_03_0060 MSRunSequence 46    2
creatine    exp048a_02_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_03_0000 MSRunSequence 46    2
creatine    exp048a_02_0240 MSRunSequence 46    2
cytidine    exp048a_01_0000 MSRunSequence 46    2
cytidine    exp048a_01_0020 MSRunSequence 46    2
creatine    exp048a_03_0060 MSRunSequence 46    2
cytidine    exp048a_05_0020 MSRunSequence 46    2
cytidine    exp048a_05_0000 MSRunSequence 46    2
thymidine   exp048a_04_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_05_0240 MSRunSequence 46    2
thymidine   exp048a_05_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_05_0020 MSRunSequence 46    2
thymidine   exp048a_02_0060 MSRunSequence 46    2
arginine    exp027f4_free_M02_brain MSRunSequence 51    2
lysine  exp027f4_free_M03_lung  MSRunSequence 51    2
lysine  exp027f4_free_M03_ceccon    MSRunSequence 51    2
lysine  exp027f4_free_M03_gWAT  MSRunSequence 51    2
lysine  exp027f4_free_M03_pancreas  MSRunSequence 51    2
lysine  exp027f4_free_M02_pancreas  MSRunSequence 51    2
arginine    exp027f4_free_M02_liver MSRunSequence 51    2
arginine    exp027f4_free_M03_iWAT  MSRunSequence 51    2
cytidine    exp048a_06_0000 MSRunSequence 46    2
carnosine   exp048a_06_0020 MSRunSequence 46    2
cytidine    exp048a_06_0060 MSRunSequence 46    2
lysine  exp027f4_free_M02_kidney    MSRunSequence 51    2
arginine    exp027f4_free_M03_pancreas  MSRunSequence 51    2
carnosine   exp048a_07_0060 MSRunSequence 46    2
lysine  exp027f4_free_M03_eye   MSRunSequence 51    2
lysine  exp027f4_free_M03_brain MSRunSequence 51    2
lysine  exp027f4_free_M02_BAT   MSRunSequence 51    2
cytidine    exp048a_07_0000 MSRunSequence 46    2
arginine    exp027f4_free_M03_skin  MSRunSequence 51    2
carnosine   exp048a_06_0000 MSRunSequence 46    2
thymidine   exp048a_06_0240 MSRunSequence 46    2
lysine  exp027f4_free_M03_jejunum   MSRunSequence 51    2
arginine    exp027f4_free_M03_BAT   MSRunSequence 51    2
lysine  exp027f4_free_M03_kidney    MSRunSequence 51    2
thymidine   exp048a_06_0000 MSRunSequence 46    2
lysine  exp027f4_free_M02_quad  MSRunSequence 51    2
arginine    exp027f4_free_M02_spleen    MSRunSequence 51    2
lysine  exp027f4_free_M03_quad  MSRunSequence 51    2
arginine    exp027f4_free_M02_kidney    MSRunSequence 51    2
3-Ureidopropionic acid  exp048a_06_0000 MSRunSequence 46    2
lysine  exp027f4_free_M03_colon MSRunSequence 51    2
carnosine   exp048a_07_0240 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_07_0060 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_06_0020 MSRunSequence 46    2
lysine  exp027f4_free_M03_stom  MSRunSequence 51    2
lysine  exp027f4_free_M02_brain MSRunSequence 51    2
arginine    exp027f4_free_M03_brain MSRunSequence 51    2
3-Ureidopropionic acid  exp048a_07_0020 MSRunSequence 46    2
arginine    exp027f4_free_M03_gWAT  MSRunSequence 51    2
arginine    exp027f4_free_M03_plasma    MSRunSequence 51    2
lysine  exp027f4_free_M02_spleen    MSRunSequence 51    2
arginine    exp027f4_free_M03_testis    MSRunSequence 51    2
arginine    exp027f4_free_M03_jejunum   MSRunSequence 51    2
thymidine   exp048a_07_0000 MSRunSequence 46    2
cytidine    exp048a_07_0020 MSRunSequence 46    2
lysine  exp027f4_free_M03_iWAT  MSRunSequence 51    2
lysine  exp027f4_free_M03_dia   MSRunSequence 51    2
thymidine   exp048a_07_0240 MSRunSequence 46    2
lysine  exp027f4_free_M03_testis    MSRunSequence 51    2
arginine    exp027f4_free_M03_dia   MSRunSequence 51    2
arginine    exp027f4_free_M03_lung  MSRunSequence 51    2
arginine    exp027f4_free_M03_spleen    MSRunSequence 51    2
arginine    exp027f4_free_M03_quad  MSRunSequence 51    2
lysine  exp027f4_free_M02_liver MSRunSequence 51    2
arginine    exp027f4_free_M03_ceccon    MSRunSequence 51    2
arginine    exp027f4_free_M02_pancreas  MSRunSequence 51    2
arginine    exp027f4_free_M03_heart MSRunSequence 51    2
lysine  exp027f4_free_M02_heart MSRunSequence 51    2
3-Ureidopropionic acid  exp048a_07_0000 MSRunSequence 46    2
arginine    exp027f4_free_M02_heart MSRunSequence 51    2
arginine    exp027f4_free_M03_liver MSRunSequence 51    2
thymidine   exp048a_06_0020 MSRunSequence 46    2
arginine    exp027f4_free_M03_colon MSRunSequence 51    2
3-Ureidopropionic acid  exp048a_06_0060 MSRunSequence 46    2
lysine  exp027f4_free_M03_heart MSRunSequence 51    2
carnosine   exp048a_06_0060 MSRunSequence 46    2
arginine    exp027f4_free_M02_quad  MSRunSequence 51    2
3-Ureidopropionic acid  exp048a_07_0240 MSRunSequence 46    2
cytidine    exp048a_07_0240 MSRunSequence 46    2
thymidine   exp048a_07_0060 MSRunSequence 46    2
arginine    exp027f4_free_M02_plasma_20220909142304 MSRunSequence 51    2
cytidine    exp048a_07_0060 MSRunSequence 46    2
cytidine    exp048a_06_0240 MSRunSequence 46    2
arginine    exp027f4_free_M02_colon MSRunSequence 51    2
lysine  exp027f4_free_M03_spleen    MSRunSequence 51    2
lysine  exp027f4_free_M03_BAT   MSRunSequence 51    2
lysine  exp027f4_free_M03_skin  MSRunSequence 51    2
arginine    exp027f4_free_M02_BAT   MSRunSequence 51    2
thymidine   exp048a_07_0020 MSRunSequence 46    2
carnosine   exp048a_07_0000 MSRunSequence 46    2
lysine  exp027f4_free_M02_plasma_20220909142304 MSRunSequence 51    2
arginine    exp027f4_free_M03_stom  MSRunSequence 51    2
cytidine    exp048a_06_0020 MSRunSequence 46    2
3-Ureidopropionic acid  exp048a_06_0240 MSRunSequence 46    2
carnosine   exp048a_07_0020 MSRunSequence 46    2
thymidine   exp048a_06_0060 MSRunSequence 46    2
lysine  exp027f4_free_M02_colon MSRunSequence 51    2
arginine    exp027f4_free_M03_eye   MSRunSequence 51    2
lysine  exp027f4_free_M03_plasma    MSRunSequence 51    2
lysine  exp027f4_free_M03_liver MSRunSequence 51    2
arginine    exp027f4_free_M03_kidney    MSRunSequence 51    2
carnosine   exp048a_06_0240 MSRunSequence 46    2
Number of peak groups with multiple representations (in a sample / sequence): 180

But if you consider ONLY peak group compound and SAMPLE (i.e. do not include the sequence in the unique constraint), we have 201 instances of multiple representations (listing by compound and sample):

In [18]: for dct in PeakGroup.objects.values("name", "msrun_sample__sample__name").annotate(pgs_per_sample=Count("name")).filter(pgs_per_sample__gt=1):
    ...:     print(f"{dct['name']}\t{dct['msrun_sample__sample__name']}\t{dct['pgs_per_sample']}")
    ...: print(f"Number of peak groups with multiple representations (in a sample regardless of sequence): {PeakGroup.objects.values('name', 'msrun_sample__sample__name').annotate(pgs_per_sample=Count('name')).filter(pgs_per_sample__gt=1).count()}")
3-hydroxybutyrate   col005d_blank2  2
3-Ureidopropionic acid  exp048a_01_0000 2
3-Ureidopropionic acid  exp048a_01_0020 2
3-Ureidopropionic acid  exp048a_01_0060 2
3-Ureidopropionic acid  exp048a_01_0240 2
3-Ureidopropionic acid  exp048a_01_1080 2
3-Ureidopropionic acid  exp048a_01_1440 2
3-Ureidopropionic acid  exp048a_02_0000 2
3-Ureidopropionic acid  exp048a_02_0020 2
3-Ureidopropionic acid  exp048a_02_0060 2
3-Ureidopropionic acid  exp048a_02_0240 2
3-Ureidopropionic acid  exp048a_03_0000 2
3-Ureidopropionic acid  exp048a_03_0020 2
3-Ureidopropionic acid  exp048a_03_0060 2
3-Ureidopropionic acid  exp048a_03_0240 2
3-Ureidopropionic acid  exp048a_04_0000 2
3-Ureidopropionic acid  exp048a_04_0020 2
3-Ureidopropionic acid  exp048a_04_0060 2
3-Ureidopropionic acid  exp048a_04_0240 2
3-Ureidopropionic acid  exp048a_05_0000 2
3-Ureidopropionic acid  exp048a_05_0020 2
3-Ureidopropionic acid  exp048a_05_0060 2
3-Ureidopropionic acid  exp048a_05_0240 2
3-Ureidopropionic acid  exp048a_06_0000 2
3-Ureidopropionic acid  exp048a_06_0020 2
3-Ureidopropionic acid  exp048a_06_0060 2
3-Ureidopropionic acid  exp048a_06_0240 2
3-Ureidopropionic acid  exp048a_07_0000 2
3-Ureidopropionic acid  exp048a_07_0020 2
3-Ureidopropionic acid  exp048a_07_0060 2
3-Ureidopropionic acid  exp048a_07_0240 2
arginine    exp027f4_free_M02_BAT   2
arginine    exp027f4_free_M02_brain 2
arginine    exp027f4_free_M02_colon 2
arginine    exp027f4_free_M02_heart 2
arginine    exp027f4_free_M02_kidney    2
arginine    exp027f4_free_M02_liver 2
arginine    exp027f4_free_M02_pancreas  2
arginine    exp027f4_free_M02_plasma_20220909142304 2
arginine    exp027f4_free_M02_quad  2
arginine    exp027f4_free_M02_spleen    2
arginine    exp027f4_free_M03_BAT   2
arginine    exp027f4_free_M03_brain 2
arginine    exp027f4_free_M03_ceccon    2
arginine    exp027f4_free_M03_colon 2
arginine    exp027f4_free_M03_dia   2
arginine    exp027f4_free_M03_eye   2
arginine    exp027f4_free_M03_gWAT  2
arginine    exp027f4_free_M03_heart 2
arginine    exp027f4_free_M03_iWAT  2
arginine    exp027f4_free_M03_jejunum   2
arginine    exp027f4_free_M03_kidney    2
arginine    exp027f4_free_M03_liver 2
arginine    exp027f4_free_M03_lung  2
arginine    exp027f4_free_M03_pancreas  2
arginine    exp027f4_free_M03_plasma    2
arginine    exp027f4_free_M03_quad  2
arginine    exp027f4_free_M03_skin  2
arginine    exp027f4_free_M03_spleen    2
arginine    exp027f4_free_M03_stom  2
arginine    exp027f4_free_M03_testis    2
C18:1   col005d_blank2  2
C18:2   col005d_blank2  2
carnosine   exp048a_06_0000 2
carnosine   exp048a_06_0020 2
carnosine   exp048a_06_0060 2
carnosine   exp048a_06_0240 2
carnosine   exp048a_07_0000 2
carnosine   exp048a_07_0020 2
carnosine   exp048a_07_0060 2
carnosine   exp048a_07_0240 2
citrate/isocitrate  col005d_blank2  2
creatine    col005d_blank2  2
creatine    exp048a_01_0000 2
creatine    exp048a_01_0020 2
creatine    exp048a_01_0060 2
creatine    exp048a_01_0240 2
creatine    exp048a_01_1080 2
creatine    exp048a_01_1440 2
creatine    exp048a_02_0000 2
creatine    exp048a_02_0020 2
creatine    exp048a_02_0060 2
creatine    exp048a_02_0240 2
creatine    exp048a_03_0000 2
creatine    exp048a_03_0020 2
creatine    exp048a_03_0060 2
creatine    exp048a_03_0240 2
creatine    exp048a_04_0000 2
creatine    exp048a_04_0020 2
creatine    exp048a_04_0060 2
creatine    exp048a_04_0240 2
creatine    exp048a_05_0000 2
creatine    exp048a_05_0020 2
creatine    exp048a_05_0060 2
creatine    exp048a_05_0240 2
cytidine    exp048a_01_0000 2
cytidine    exp048a_01_0020 2
cytidine    exp048a_01_0060 2
cytidine    exp048a_01_0240 2
cytidine    exp048a_01_1080 2
cytidine    exp048a_01_1440 2
cytidine    exp048a_02_0000 2
cytidine    exp048a_02_0020 2
cytidine    exp048a_02_0060 2
cytidine    exp048a_02_0240 2
cytidine    exp048a_03_0000 2
cytidine    exp048a_03_0020 2
cytidine    exp048a_03_0060 2
cytidine    exp048a_03_0240 2
cytidine    exp048a_04_0000 2
cytidine    exp048a_04_0020 2
cytidine    exp048a_04_0060 2
cytidine    exp048a_04_0240 2
cytidine    exp048a_05_0000 2
cytidine    exp048a_05_0020 2
cytidine    exp048a_05_0060 2
cytidine    exp048a_05_0240 2
cytidine    exp048a_06_0000 2
cytidine    exp048a_06_0020 2
cytidine    exp048a_06_0060 2
cytidine    exp048a_06_0240 2
cytidine    exp048a_07_0000 2
cytidine    exp048a_07_0020 2
cytidine    exp048a_07_0060 2
cytidine    exp048a_07_0240 2
glutamate   col005d_blank2  2
glutamine   col005d_blank2  2
homocarnosine   col005d_blank2  2
isoleucine  col005d_blank2  2
lactate col005d_blank2  2
leucine col005d_blank2  2
lysine  exp027f4_free_M02_BAT   2
lysine  exp027f4_free_M02_brain 2
lysine  exp027f4_free_M02_colon 2
lysine  exp027f4_free_M02_heart 2
lysine  exp027f4_free_M02_kidney    2
lysine  exp027f4_free_M02_liver 2
lysine  exp027f4_free_M02_pancreas  2
lysine  exp027f4_free_M02_plasma_20220909142304 2
lysine  exp027f4_free_M02_quad  2
lysine  exp027f4_free_M02_spleen    2
lysine  exp027f4_free_M03_BAT   2
lysine  exp027f4_free_M03_brain 2
lysine  exp027f4_free_M03_ceccon    2
lysine  exp027f4_free_M03_colon 2
lysine  exp027f4_free_M03_dia   2
lysine  exp027f4_free_M03_eye   2
lysine  exp027f4_free_M03_gWAT  2
lysine  exp027f4_free_M03_heart 2
lysine  exp027f4_free_M03_iWAT  2
lysine  exp027f4_free_M03_jejunum   2
lysine  exp027f4_free_M03_kidney    2
lysine  exp027f4_free_M03_liver 2
lysine  exp027f4_free_M03_lung  2
lysine  exp027f4_free_M03_pancreas  2
lysine  exp027f4_free_M03_plasma    2
lysine  exp027f4_free_M03_quad  2
lysine  exp027f4_free_M03_skin  2
lysine  exp027f4_free_M03_spleen    2
lysine  exp027f4_free_M03_stom  2
lysine  exp027f4_free_M03_testis    2
malate  col005d_blank2  2
methionine  col005d_blank2  2
phenylalanine   col005d_blank2  2
proline col005d_blank2  2
pyruvate    col005d_blank2  2
serine  col005d_blank2  2
succinate   col005d_blank2  2
threonine   col005d_blank2  2
thymidine   exp048a_01_0000 2
thymidine   exp048a_01_0020 2
thymidine   exp048a_01_0060 2
thymidine   exp048a_01_0240 2
thymidine   exp048a_01_1080 2
thymidine   exp048a_01_1440 2
thymidine   exp048a_02_0000 2
thymidine   exp048a_02_0020 2
thymidine   exp048a_02_0060 2
thymidine   exp048a_02_0240 2
thymidine   exp048a_03_0000 2
thymidine   exp048a_03_0020 2
thymidine   exp048a_03_0060 2
thymidine   exp048a_03_0240 2
thymidine   exp048a_04_0000 2
thymidine   exp048a_04_0020 2
thymidine   exp048a_04_0060 2
thymidine   exp048a_04_0240 2
thymidine   exp048a_05_0000 2
thymidine   exp048a_05_0020 2
thymidine   exp048a_05_0060 2
thymidine   exp048a_05_0240 2
thymidine   exp048a_06_0000 2
thymidine   exp048a_06_0020 2
thymidine   exp048a_06_0060 2
thymidine   exp048a_06_0240 2
thymidine   exp048a_07_0000 2
thymidine   exp048a_07_0020 2
thymidine   exp048a_07_0060 2
thymidine   exp048a_07_0240 2
tryptophan  col005d_blank2  2
valine  col005d_blank2  2
Number of peak groups with multiple representations (in a sample regardless of sequence): 201
hepcat72 commented 1 month ago

As far as the logic goes, it is only used for retrieving the "last" peak group for a tracer from either any given sample (in DataRepo.models.sample.py) or from the "last" serum sample (in DataRepo.models.fcirc.py):

Note the order_by("msrun_sample__msrun_sequence__date") followed by last() in both functions. I.e. We sort by date, but we can have 2 peak groups (from the same sample but different sequences) at the end of the ordered results from the same date. The ordering of those 2 equivalent peak groups is arbitrary (if they were run on the same date).

DataRepo.models.sample.py:

    def last_tracer_peak_groups(self):
        """
        Retrieves the last Peak Group for each tracer compound
        """

        # Get every tracer's compound
        if self.animal.tracers.count() == 0:
            warnings.warn(f"Animal [{self.animal}] has no tracers.")
            return PeakGroup.objects.none()

        # Get the last peakgroup for each tracer
        last_peakgroup_ids = []
        for tracer in self.animal.tracers.all():
            tracer_peak_group = (
                PeakGroup.objects.filter(msrun_sample__sample__id__exact=self.id)
                .filter(compounds__id__exact=tracer.compound.id)
                .order_by("msrun_sample__msrun_sequence__date")
                .last()
            )
            if tracer_peak_group:
                last_peakgroup_ids.append(tracer_peak_group.id)
            else:
                warnings.warn(
                    f"Sample {self} has no peak group for tracer compound: [{tracer.compound}]."
                )
                return PeakGroup.objects.none()

        return PeakGroup.objects.filter(id__in=last_peakgroup_ids)

DataRepo.models.fcirc.py:

    def peak_groups(self):
        """
        Retrieve all PeakGroups for this serum sample and tracer, regardless of msrun_sequence date.

        Currently unused - see docstring in self.is_last_serum_peak_group
        """
        from DataRepo.models.peak_group import PeakGroup

        peakgroups = (
            PeakGroup.objects.filter(msrun_sample__sample__exact=self.serum_sample)
            .filter(compounds__exact=self.tracer.compound)
            .order_by("msrun_sample__msrun_sequence__date")
        )

        if peakgroups.count() == 0:
            warnings.warn(
                f"Serum sample {self.serum_sample} has no peak group for tracer {self.tracer}."
            )

        return peakgroups.all()

Note that the "last serum sample" code is a separate function.

hepcat72 commented 1 month ago

This satisfies this task. I will close this and link the multrep discussion page to this issue.

lparsons commented 1 month ago

I was looking for some additional info on each multiple representation, so I reworked the query. It's not nearly as concise a solution as yours, @hepcat72, but it does output the peak annotation files and the study names. Also, for some reason I get 202 records, not 201 like you did. I'm not sure what is different.

The script is at tracebase.princeton.edu:/var/www/tracebase/tracebase-multiple-representations.py.

#!/usr/bin/env python
# coding: utf-8

from DataRepo.models import *
import pandas as pd
import os

pgdf = pd.DataFrame.from_dict(PeakGroup.objects.values('id', 'msrun_sample__sample', 'compounds'))

sample_compound_count = pgdf.groupby(['msrun_sample__sample', 'compounds']).count()
multiple_representations = sample_compound_count.loc[(sample_compound_count['id'] > 1)]
multiple_representations.reset_index()

print("sample\tcompound\tpeak_annotation_files\tstudies")
for row in multiple_representations.reset_index().itertuples():
    sample = Sample.objects.get(pk=row.msrun_sample__sample)
    compound = Compound.objects.get(pk=row.compounds)
    studies = list(sample.animal.studies.all().values_list("name", flat=True))
    peak_groups = PeakGroup.objects.filter(msrun_sample__sample=sample, compounds=compound)
    print(f"{sample}\t{compound}\t{list(peak_groups.values_list('peak_annotation_file__filename', flat=True))}\t{studies}")

The results are in a Google Sheet: https://docs.google.com/spreadsheets/d/1HsLdBP1AU4OqTphWhUEtRn6lGTzp-gkBAe2Ui5rHncY/edit?usp=sharing

hepcat72 commented 1 month ago

I suspect it's because you used PeakGroup.compounds instead of PeakGroup.name, resulting in separate rows for citrate and isocitrate. Those get listed as 1 indistinguishable peak group.

(I get 1 row for the pair: citrate/isocitrate col005d_blank2 2.)

lparsons commented 1 month ago

I suspect it's because you used PeakGroup.compounds instead of PeakGroup.name, resulting in separate rows for citrate and isocitrate. Those get listed as 1 indistinguishable peak group.

(I get 1 row for the pair: citrate/isocitrate col005d_blank2 2.)

Yeah, that's likely it. Since we use the researcher provided text for the peak group name, I didn't want to use that and risk missing a duplicate that used a synonym, but you're right, this logic isn't quite correct and creates two rows. Really should be a single row.

hepcat72 commented 1 month ago

I didn't want to use that and risk missing a duplicate that used a synonym

As far as I know, we allow peak groups linked to the same compound if a synonym is used (because the synonym could represent a qualitatively different compound, e.g. "stereo-isomers"). Hence, it has been my inference that those are not multiple representations. For example, you could theoretically have a peak group for L-Threonine and one for R-Threonine that are from the same sample. I'd inferred that that was part of the point of allowing synonyms to be in the peak group name. Both would intentionally link to the same compound. If that's what we want, then the method you used would risk deleting a valid peak group.

That all said however, I realize that we never made an explicit decision about this. I simply inferred that the reasoning for allowing different synonyms (and making the unique constraint hinge on the name) logically lead to allowing different peak groups of the same compound.

Michael has indicated however, that he thought that perhaps we should not support differentiation of stereoisomers, so I have been assuming that we would eventually decide to go that route. In that context, your method is what we want...

But I think that if that's the case, then we should re-think the peak group name and the unique constraint (as well as the clean method).