Add PeakGroupSet table class - Githubissues

Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab

MIT License

4 stars 1 forks source link

Add PeakGroupSet table class #73

Closed lparsons closed 3 years ago

lparsons commented 3 years ago

FEATURE DESCRIPTION

Feature Inspiration

We need a way to track the provenance of imported PeakGroup data, both for error checking and later lookup. Also, there is potentially some value in the grouping of PeakGroup data provided by a single AccuCor file (data analysis session). For example, if a researcher performs a second analysis of the same MSRun data, adding new compounds to the data, we currently have no way to tell that some PeakGroups were added separately from the original PeakGroup import.

Feature Description

Add a PeakGroupSet class that tracks the filename and import time of a given AccuCor file, and associate the PeakGroups with that PeakGroupSet record.

Alternatives Considered

We had discussed tracking simply the filename, though tracking the time of import seems simple enough to add.

We also discussed tracking the researcher that uploaded the data, but since we have not incorporated the user model yet and are not providing uploads via the web interface, this will be implemented later if needed.

Comment

# Track the source of PeakGroup data
PeakGroupSet
-
id PK int
filename string
imported_timestamp datetime

PeakGroup
-
id PK int
name string     # Compound or isomer group name
formula string  # Molecular formula
compound_id int FK >-< Compound.id
peak_group_set_id int FK >- PeakGroupSet.id
ms_run_id int FK >- MSRun.id
# Additional computed values
# total_abundance
# enrichment_fraction
# enrichment_abundance
# normalized_labeling

ISSUE OWNER SECTION

Assumptions

The basename of the filename will be parsed presumably from --accucor-file

Requirements

--accucor-file is already a required argument

Limitations

A list of things this work will specifically not do
E.g. This feature will only handle the most frequent use case X

Affected/Changed Components

load_accucor_msruns.py parsing the basename from --accucor-file, and passing it along
utils.py AccuCorDataLoader, accepting the basename from keyword args
PeakGroup and PeakGroupSet model additions
possible test changes/additions

DESIGN

GUI Change description

Input APIs and usage remain unchanged. Views and queries would presumably adjust to use this data.

Code Change Description (Pseudocode optional)

use os.path.basename

Tests

A test should be planned for each requirement (above), where possible.

jcmatese commented 3 years ago

Note: this will likely require and database flush for these migrations, if PeakGroupSet is required

jcmatese commented 3 years ago

@lparsons @hepcat72 @fkang-pu Regarding filename, are there any constraints to this table? ( i.e. is filename unique ?). Asking because it is easy to add a single/cached PeakGroupSet for a single run of load_accucor_msruns.py , but if you run it multiple distinct times with different peak sets but potentially the same filenames, things might be a little tricky to interpret afterwards. Just asking for a first-pass specification, because load_accucor_msruns.py wasn't really coded with the "add new peak groups to previously loaded/inferred msruns" mindset (#65).

jcmatese commented 3 years ago

I guess I will just insert 1, no matter the name, and hope for the best

lparsons commented 3 years ago

I don't see a big concern right now about constraints. I think ultimately, we'd want to store that file somewhere for provenance.

As for load_accucor_msruns.py, I would probably make the assumption that the filename was unique and do a get_or_create using the filename as a lookup. At least, to me, that would work for a while. Please let me know if you have an alternate suggestion though, I think you've probably thought about this more deeply than I at this point.

jcmatese commented 3 years ago

It is just that with the imported_timestamp presumably defaulting to "now" in the model (e.g. this load_accucor_msruns.py execution), then we should proabably never "get"/retrieve a previous one, because that was a different time and potentially a different file (albeit with a shared name).

lparsons commented 3 years ago

Good point @jcmatese. Would it make you more comfortable to add the unique constraint on the filename? That seems reasonable to me.

jcmatese commented 3 years ago

Nah, that is ok, as long as everyone is ok with the interpretation (always report as "this file at this date time")

fkang-pu commented 3 years ago

It's a good idea to add unique constraint on the filename.

jcmatese commented 3 years ago

OK, I can add it if you all want.

jcmatese commented 3 years ago

@lparsons @fkang-pu @hepcat72 regarding foreign-key/attribute naming conventions, do you have a strong preference going forward on how to name columns/methods such as the existing PeakGroup.ms_run referencing MsRun class (i.e. added underscore), https://github.com/Princeton-LSI-ResearchComputing/tracebase/blob/e9be74d48cc5b66eb5d112193fe177f7f18070f9/DataRepo/models.py#L216-L222 or the presumptive addition in this issue of either PeakGroup.peakgroupset or PeakGroup.peak_group_set for the foreign key from PeakGroup to PeakGroupSet

jcmatese commented 3 years ago

May want to add a note on the coding/naming styles going forward, if a consensus decision is made.

jcmatese commented 3 years ago

Generally, camelcase to underscore seems to be the more popular representation.

lparsons commented 3 years ago

Good question John. I would prefer we be consistent with how other fields and relationships are named. I believe that would be using underscores (_), so probably PeakGroup.peak_group_set. I'm fine with updating those notes, but I'd prefer they live in the repository in git and not on nplcadmin docs (e.g. CONTRIBUTING.md or README.md).

fkang-pu commented 3 years ago

Using understore is fine with me - easy to read. So far, I can retrieve the value(s) for foreign keys as long as I use the field names defined for that model class.

jcmatese commented 3 years ago

Just noting that to add this non-nullable foreign key to PeakGroup, I will probably need to revert to an earlier version of the database/model, and make migrations from there...

$``` python manage.py migrate DataRepo 0003 Operations to perform: Target specific migration: 0003_auto_20210402_1105, from DataRepo Running migrations: Rendering model states... DONE Unapplying DataRepo.0011_peakdata_help_test... OK Unapplying DataRepo.0010_merge_20210504_1613... OK Unapplying DataRepo.0009_auto_20210429_1433... OK Unapplying DataRepo.0008_changed_unique_constraints_coding... OK Unapplying DataRepo.0007_peakdata_validator... OK Unapplying DataRepo.0006_msrun_researcher... OK Unapplying DataRepo.0006_auto_20210503_1855... OK Unapplying DataRepo.0005_auto_20210421_1054... OK Unapplying DataRepo.0004_PeakGroup_Data... OK

jcmatese commented 3 years ago

Had to go back even farther.

$ python manage.py makemigrations --name PeakGroupSet DataRepo
Migrations for 'DataRepo':
  DataRepo/migrations/0002_PeakGroupSet.py
    - Create model Animal
    - Create model MSRun
    - Create model PeakGroupSet
    - Create model Protocol
    - Create model Study
    - Create model Tissue
    - Alter field hmdb_id on compound
    - Create model Sample
    - Create model PeakGroup
    - Create model PeakData
    - Add field protocol to msrun
    - Add field sample to msrun
    - Add field studies to animal
    - Add field tracer_compound to animal
    - Create constraint unique_peakgroup on model peakgroup
    - Create constraint unique_peakdata on model peakdata
    - Create constraint unique_msrun on model msrun

Adding any non-nullable foreign key can be an issue, migration wise. Not certain of the best approach, and this may certainly come up again with Animal.treatment

lparsons commented 3 years ago

Any reason not to regenerate the migrations completely at this point? Going forward, if we have to do this with real data, we there needs to be a method to set the field (either default or by lookup).

jcmatese commented 3 years ago

We could regenerate completely, I suppose. The issue with these non-nullable foreign key additions is that you might first have to create the model (PeakGroupSet, migration no. 1 ). Then insert the initial default value (say filename=NA, possibly with migration no. 1). Then add the default to the dependent model (PeakGroup with that id=1 default, migration no. 2). At least, I think that is how you might have to do it, and that would see like a custom edited migration file, I think (because of the insert, and stepwise additions). I am just reverting back (above), because it doesn't really matter now, but if the database every had values we wanted to keep, we would have to figure out the appropriate protocol to pull this off.

lparsons commented 3 years ago

Good point, I do think they would require some custom migrations in order to be applied cleanly on top of existing records.

jcmatese commented 3 years ago

I ended up in kind of a mess. Had to drop the database and re-init, here. Wasn't what I intended, but that is what happened...

hepcat72 commented 3 years ago

Very good issue description and issue owner documentation! Clear, concise, and comprehensive.