ga4gh / vrs-python

GA4GH Variation Representation Python Implementation
https://github.com/ga4gh/vrs
Apache License 2.0
50 stars 27 forks source link

`ga4gh_identify` adds `digest` properties in-place regardless of `in_place` argument #440

Open jsstevenson opened 1 month ago

jsstevenson commented 1 month ago

I have a basic allele. If I manually add the VA ID, the compacted JSON dump is 323 chars:

>>> allele = Allele(location={"end": 44908822, "start": 44908821, "sequenceReference": {"id": "NC_0000019.10", "type": "SequenceReference", "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"}, "type": "SequenceLocation"}, state={"sequence": "T", "type":  "LiteralSequenceExpression"}, type="Allele")
>>> allele.id = "ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt"
>>> len(allele.model_dump_json(exclude_none=True))
323

If I use ga4gh_identify, though, even if I set in_place to "never", digest fields are added that makes the final JSON over 400 characters:

>>> allele = Allele(location={"end": 44908822, "start": 44908821, "sequenceReference": {"id": "NC_0000019.10", "type": "SequenceReference", "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"}, "type": "SequenceLocation"}, state={"sequence": "T", "type":  "LiteralSequenceExpression"}, type="Allele")
>>> allele.id = ga4gh_identify(allele, "never")
>>> len(allele.model_dump_json(exclude_none=True))
411
>>> print(allele.model_dump_json(exclude_none=True, indent=2))
{
  "id": "ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt",
  "type": "Allele",
  "digest": "0AePZIWZUNsUlQTamyLrjm2HWUw2opLt",
  "location": {
    "type": "SequenceLocation",
    "digest": "wIlaGykfwHIpPY2Fcxtbx4TINbbODFVz",
    "sequenceReference": {
      "id": "NC_0000019.10",
      "type": "SequenceReference",
      "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"
    },
    "start": 44908821,
    "end": 44908822
  },
  "state": {
    "type": "LiteralSequenceExpression",
    "sequence": "T"
  }
}

This isn't the end of the world, but an increase in size of ~25% ends up being pretty hefty in extreme cases. is it intentional/necessary for ga4gh_identify to add these digest fields in-place, or can that be taken out? Or alternatively, maybe a ga4gh_compact method could strip unset/redundant fields?

korikuzma commented 1 month ago

Since digest is optional, I think it's a bug. I think in_place should apply to both id and digest fields. @ga4gh/vrs-python-maintainers thoughts?