Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Infusate and Tracer model creation and validation #444

Closed jcmatese closed 2 years ago

jcmatese commented 2 years ago

FEATURE DESCRIPTION

Feature Inspiration

Once an infusate string has been parsed, there is some additional validation to be performed. A Tracer that is not fully labeled (e.g. all copies of a labeled element are labeled) must contain positional information for each labeled element. Also, labeled elements must exist in the formula of the related compound. The data must ultimately lead to valid model objects created (or retrieved) from the database. This in intended to be used during loading of new datasets.

Feature Description

Input is data structures from #407.

Alternatives

A brief description of any alternative features that could accomplish the same ultimate goal, either for consideration or considered and rejected.

Comment

Add any other context or screenshots about the feature request here.


ISSUE OWNER SECTION

Assumptions

Requirements

Limitations

Affected Components

DESIGN

GUI Change description

Nothing. Loading code will be written separately and will be the primary interface to use this code.

Code Change Description

General strategy:

Examples:

class TracerLabelManager(model.Manager):
    def create_tracer_label(self, isotope_data: IsotopeData):
        tracer_label = self.create(
            element=isotope_data.element, 
            count=isotope_data.labeled_count,
            positions=isotope_data.labeled_positions,
            mass_number=isotope_data.mass_number)
        tracer_label.full_clean()  # Not sure this belongs here, perhaps should be in loading code
        return tracer_label

class TracerLabel(MaintainedModel, ElementLabel):
    ...
    objects = TracerLabelManager()

    # Add model validation
    def clean(self):
        # Ensure positions match count
        if positions and len(positions) != count:
            raise ValidationError(_(f"Length of labeled positions list ({len(positions)}) must match labeled element count ({count})")
class TracerManager(models.Manager):
    def create_tracer(self, tracer_data: TracerData):
        isotopes = []
        for isotope_data in tracer_data.isotopes:
            isotope = TracerLabel.objects.create_tracer_label(isotope_data)
            isotopes.append(isotope)
        tracer = self.create(
            compound=TracerData.compound_name,
            isotopes=isotopes)
        tracer.full_clean()  # Not sure this belongs here either
        return tracer

class Tracer(MaintainedModel, ElementLabel):
    ...
    objects = TracerManager()

    def clean(self):
        # Ensure isotope elements exist in compound formula
        ...
        # Ensure isotope count doesn't exceed number of elements in formula
        ...
        # Ensure postions exist if count < number of elements in formula
class IfusateManager(models.Manager):
    def create_infusate(self, infusate_data: InfusateData):
        # create tracers
        # create infusate
        # associate tracers with specific conectrations
        infusate.full_clean()
        return infusate

class Infusate(MaintainedModel):
    ...
    objects = InfusateManager()

    def clean(self):
        # Ensure that any existing infusate with same `tracer_group_name` contains the same list of tracers
        ...

Tests

lparsons commented 2 years ago

@jcmatese @fkang-pu @hepcat72 I've updated the Design portion of this issue. Please review the proposed design and either add a 🚀 if you approve or comment if you have questions, suggestions, etc.

jcmatese commented 2 years ago

I think you can't validate much of the Trace/Infusate data without the compound already being in the database , because tests are mostly dependent on the pre-existing/documented compound.formula. Will have to look into the Manager patterns...

lparsons commented 2 years ago

I think you can't validate much of the Trace/Infusate data without the compound already being in the database , because tests are mostly dependent on the pre-existing/documented compound.formula.

Yes, good point. I'll add that to assumptions.

Will have to look into the Manager patterns...

The link to some docs is in the issue: add a method on a custom manager.

hepcat72 commented 2 years ago

I wish there was a review interface for this sort of thing similar to PRs. Until that exists, I will review the design section by pasting the design section below as a quoted block and using strike-through and bold for change suggestions...

ISSUE OWNER SECTION

Assumptions

  • The concentrations can be properly associated with each constituent tracer in an infusate. This might require some changing of the data structure when loading is written in #427.

Q. What's the reason for this assumption? Doesn't the model inherently already make this association in the InfusateTracer table? If that enforces this ability, I would say that we don't need this assumption... Or perhaps I'm missing the context of this assumption.

  • Compound records already exist in the database.

NB. Requirement 1a essentially checks that the compound exists in the database as a requirement, thus Compound records already exist in the database, unless I'm missing something, is not assumed.

Requirements

  • [ ] 1. Create and validate model objects or retrieve existing objects from database and validate the new data matches
    • [ ] a. validate that TracerData["compound_name"] is either a Compound.name or CompoundSynonym, and store the mapping to the database Compound, or None?

Q. What does "or None" mean? Does that mean set Tracer.compound to None? If so, note that Tracer.compound.null is False, so it cannot be None... but I suspect I'm just misunderstanding the wording.

  • [ ] b. If the optional infusate.short_name is already in the database, ensure the mapped tracer compounds equivalent

NB. Infusate.clean() calls Infusate.validate_tracer_group_names which does this, so I don't think 1b is necessary. You could put this as an assumption (or leave it as a requirement [mentioning it should happen automatically] and add a test for it). Note also that the field name is tracer_group_name, not short_name.

  • [ ] 2. Validation code in the Tracer and Infusate models that ensure the labeling and positional information conforms to our rules
    • [ ] a. Counts of labeled elements must not exceed the number of that element in the related compound's formula. For example, the formula for glucose is C6H12O6:
    • Valid: glucose[1,2-13C2]
    • Invalid glucose[13C7] since there are only 6 Carbon atoms in a molecule of `glucose
    • [ ] b. Labeled elements must exist in the related compound's formula For example, the formula for glucose is C6H12O6:
    • Valid: glucose[13C6]
    • Invalid glucose[15N1] since Nitrogen does not exist in glucose
    • [ ] c. Labeled atom positions are required, unless the compound is fully labeled
      • Valid: glucose[13C6] since it's fully labeled
      • Valid: glucose-[1,2-13C2] since both labeled carbons have positions specified
      • Invalid: glucose-[13C5] only five of the six total carbons are labeled, but there is no positional information specified
      • Invalid: glucose-[1,3-13C3] only two positions specified, but three labeled carbons specified

B. While these requirements may be implied, I suggest adding them explicitly:

- [ ] d. The number of positions must equal the count - [ ] e. Position values must be less than or equal to the number of positions in the compound (e.g. The number of total atoms in the compound minus the number of hydrogen atoms)

NB. You might also consider (though this is me being a perfectionist):

- [ ] f. The mass number must be an isotope of the element, e.g. valid: 13C, invalid: 12C - [ ] g. The mass number must be greater than or equal to the atomic number

Limitations

  • This issue does not cover data loading, only model creation and validation from parsed infusate records.

Affected Components

  • change: tracer_label.py
  • change: tracer.py
  • change: infusate.py

DESIGN

GUI Change description

Nothing. Loading code will be written separately and will be the primary interface to use this code.

Code Change Description

General strategy:

  • Add a method to a custom manager for the TracerLabel, Tracer, and Infusate classes that takes the data structures output from the parser to create the objects.
  • Run full_clean() before returning the created objects (not sure this should be done here, but for now it seems to make sense).
  • Add class validation code to the clean methods of the objects.

Examples:

class TracerLabelManager(model.Manager):
  def create_tracer_label(self, isotope_data: IsotopeData):
      tracer_label = self.create(
          element=isotope_data.element, 
          count=isotope_data.labeled_count,
          positions=isotope_data.labeled_positions,
          mass_number=isotope_data.mass_number)
      tracer_label.full_clean()  # Not sure this belongs here, perhaps should be in loading code
      return tracer_label

class TracerLabel(MaintainedModel, ElementLabel):
  ...
  objects = TracerLabelManager()

  # Add model validation
  def clean(self):
      # Ensure positions match count
      if positions and len(positions) != count:
          raise ValidationError(_(f"Length of labeled positions list ({len(positions)}) must match labeled element count ({count})")
class TracerManager(models.Manager):
  def create_tracer(self, tracer_data: TracerData):
      isotopes = []
      for isotope_data in tracer_data.isotopes:
          isotope = TracerLabel.objects.create_tracer_label(isotope_data)
          isotopes.append(isotope)
      tracer = self.create(
          compound=TracerData.compound_name,
          isotopes=isotopes)
      tracer.full_clean()  # Not sure this belongs here either
      return tracer

class Tracer(MaintainedModel, ElementLabel):
  ...
  objects = TracerManager()

  def clean(self):
      # Ensure isotope elements exist in compound formula
      ...
      # Ensure isotope count doesn't exceed number of elements in formula
      ...
      # Ensure postions exist if count < number of elements in formula
class IfusateManager(models.Manager):
  def create_infusate(self, infusate_data: InfusateData):
      # create tracers
      # create infusate
      # associate tracers with specific conectrations
      infusate.full_clean()
      return infusate

class Infusate(MaintainedModel):
  ...
  objects = InfusateManager()

  def clean(self):
      # Ensure that any existing infusate with same `tracer_group_name` contains the same list of tracers
      ...

Tests

  • [ ] 1. Test proper creation of infusate records from valid parsing tests

B. Add: Test exception raised when compound does not exist

NB. Ideally, each test should map 1:1 with the requirement items, which means that requirement 1 should be split into 2 items (validation and creation) and tests 6 & 7 below would be here (higher in the list), but as long as each requirement is tested, NBD...

  • [ ] 2. Test exception raised when positions don't match count

NB. Suggestion:

- [ ] 2. Test exception raised when positions don't match count - [ ] 2. Test exception raised when the size of the positions array does not match count

  • [ ] 3. Test exception raised when element does not exist in compound
  • [ ] 4. Test exception raised when count is greater than number of atoms in formula
  • [ ] 5. est exception raised when positions missing from partially labeled formula
  • [ ] 6. Test exception raised when creating two infusates with the same tracer_group_name, but containing a different list of tracers
  • [ ] 7. Test successful creating of two infusate with the same tracer_group_name, same list of tracers, but different concentrations.

B. Add:

- [ ] 8. Test that positions are not required when compound/element is fully labeled - [ ] 9. Test exception raised when position values are greater than the number of positions in the compound (e.g. The number of total atoms in the compound minus the number of hydrogen atoms)

NB. You might also consider (though this is me being a perfectionist):

- [ ] f. Test exception raised when mass number is not an isotope: 12C - [ ] g. Test that exception raised when mass number is less than atomic number, e.g. 5C is invalid

@lparsons - If these changes are OK with you, I'm fine with making the edits to the issue description myself. Just let me know.

jcmatese commented 2 years ago
  • [ ] a. validate that TracerData["compound_name"] is either a Compound.name or CompoundSynonym, and store the mapping to the database Compound, or None?

Q. What does "or None" mean? Does that mean set Tracer.compound to None? If so, note that Tracer.compound.null is False, so it cannot be None... but I suspect I'm just misunderstanding the wording.

I can speak to the intention of this when it was originally written. I had assume that the "validation" issue would just do that (validate), and not actual create any database records, and not necessarily throw an exception for every single problem it found, but would just flag the issues, and report back an augmented dict. In this mapping case, I just meant that could could flag it as "found, and here is the mapped compound instance", vs. "not found, i.e. None" . Again, I was focused on validation only, not necessarily what to do with the findings (load, report back throws exceptions, etc.). If you think False is an preferred flag to None, that is fine by me.

fkang-pu commented 2 years ago

I feel this should split into two issues (validation and creation, need to add infusate_name_validation.py for validation).

I picture the workflow as 3 steps :

  1. call infusate_name_parser.py to parse the input infusate string (from animal/sample Excel file)
  2. if no error in step 1, then call infusate_name_validation.py to validate InfusateData/TracerData/IsotopeData.
  3. If pass step 2, then call the model manager(s) to create db record(s).

The output of the validation (step 2 ) could be:

  1. perfect match to an existing infusate (sorted tracers names and concentrations, no need to create new db records)
  2. match tracer names but different concentrations, create a new infusate record in db;
  3. match tracer names but short name is used for another set of tracer(s), need to contact the user.
  4. missing one or more tracers, but can construct the tracer name based on input compound name or synonym. Need to contact the user to confirm suggested name for each tracer.
  5. ran into error(s) during validation and need to contact user to get more info.

So the validation needs to compare the parsed data with db records; the output could be pass/fail with a message.

hepcat72 commented 2 years ago
  • [ ] a. validate that TracerData["compound_name"] is either a Compound.name or CompoundSynonym, and store the mapping to the database Compound, or None?

Q. What does "or None" mean? Does that mean set Tracer.compound to None? If so, note that Tracer.compound.null is False, so it cannot be None... but I suspect I'm just misunderstanding the wording.

I can speak to the intention of this when it was originally written. I had assume that the "validation" issue would just do that (validate), and not actual create any database records, and not necessarily throw an exception for every single problem it found, but would just flag the issues, and report back. In this mapping case, I just meant that could could flag it as "found, and here is the mapped compound instance", vs. "not found, i.e. None" . Again, I was focused on validation only, not necessarily what to do with the findings (load, report back throws exceptions, etc.). If you think False is an preferred flag to None, that is fine by me.

I like the idea of collecting the issues and reporting the collection... though my understanding of the design here in this issue is that that would be done by loading code (which this issue states [in the limitations section] it does not do). Thus, the loading code would catch and handle the exceptions implemented in the code resulting from this issue. That does bring up a good point about this implementation however, and that is that the validation code (while called from an overloaded clean method and/or ModelManager) should be encapsulated in a separate Model method. I think that's implied, but given the design intentions of the loader, that should probably be an explicit requirement (that validation can be accomplished outside of calling full_clean, so that errors can be accumulated without having modified the database).

NB. Create a requirement that validation can be done separate from model object creation.

hepcat72 commented 2 years ago

I feel this should split into two issues (validation and creation, need to add infusate_name_validation.py for validation).

I picture the workflow as 3 steps :

  1. call infusate_name_parser.py to parse the input infusate string (from animal/sample Excel file)
  2. if no error in step 1, then call infusate_name_validation.py to validate InfusateData/TracerData/IsotopeData.
  3. If pass step 2, then call the model manager(s) to create db record(s).

The output of the validation (step 2 ) could be:

  1. perfect match to an existing infusate (sorted tracers names and concentrations, no need to create new db records)
  2. match tracer names but different concentrations, create a new infusate record in db;
  3. match tracer names but short name is used for another set of tracer(s), need to contact the user.
  4. missing one or more tracers, but can construct the tracer name based on input compound name or synonym. Need to contact the user to confirm suggested name for each tracer.
  5. ran into error(s) during validation and need to contact user to get more info.

So the validation needs to compare the parsed data with db records; the output could be pass/fail with a message.

Well, this issue is for creating the validation methods in the model classes themselves and having them be called via an overload of clean.