ga4gh / refget

GA4GH Refget specifications docs
https://ga4gh.github.io/refget
14 stars 7 forks source link

Alignment: `inherent` property #84

Closed ahwagner closed 1 week ago

ahwagner commented 3 months ago

The VRS 2.0 specification uses a property (ga4ghDigest.keys) to indicate keys that are used in creating VRS computed identifiers. This has the same functionality as the SeqCol inherent attribute.

This has been implemented across 16 VRS classes. The ga4ghDigest object is also used for other digest-related keywords, including ga4ghDigest.prefix for VRS objects that are identifiable.

Here are some examples:

This is an opportunity to align these terms before either standard is finalized. Is it feasible to reuse the VRS ga4ghDigest structure for Sequence Collections in lieu of inherent?

sveinugu commented 3 months ago

Or, if I may be so bold: the opposite way around?

What I like about inherent is that is fits very well in conversation. "Should xyz be inherent?" just feels better to say than "should xyz be in keys?"

ahwagner commented 3 months ago

We read it as "should xyz be in GA4GH Digest keys?", and "what is the GA4GH Digest prefix?".

What is useful for our purposes is tying together all of the properties associated with the GA4GH digest into one object (see above examples). That said, if keys as a variable name is unpalatable, maybe there are other aligned solutions we could target, e.g. ga4ghDigest.inherent. Would that be preferable? Or perhaps ga4ghDigest.inherentProperties, ga4ghDigest.keyProperties, or something else?

sveinugu commented 3 months ago

Standardization is obviously what GA4GH is all about, so that is way more important than my personal preferences. So one way or another, I think your suggestion is a good one!

But just to add another argument for inherent (or a variant of that): The word indicates something essential to think about when one considers whether a property should be part of the list. Since a digest is typically used to identify an object, one should consider whether the property in question is an integral part of the identity or definition of said objects, or to put it in another way, is an inherent part of the object.

I am in any case only speaking for myself. Let's hear what others have to say!

ahwagner commented 2 months ago

@andrewyatz this is a key GKS / LSG alignment point that would be good to have you weigh in on; commenting here to bump this in your inbox.

andrewyatz commented 2 months ago

Sorry we did speak about this at the latest meeting for refget. At first I didn't quite get the concept being described here but now I can see it. Also I didn't twig it was vrs 2.0.

I'm personally on the side for more alignment here since the two emerging products are both conveying the same concept. I suspect my comment will be similar to what @sveinugu has said here that alignment is important. I think perhaps there is one difference though, which on the whole might not be important and it does also build on @sveinugu that the inherent property tries to flag the inherent attributes. I guess we're reusing this concept to reflect keys. One is talking about inherent identity the other about keys for a hash.

So overall I'll put my hat on the side of alignment in the hope we can align even though semantically they're not 100% the same. They will be confusing for others. Unless we duplicate. That sounds worse

ahwagner commented 1 month ago

I think that using inherent instead of keys as an attribute name is fine; the idea is the same, the keys are selected due to their use as inherent attributes for creating the digest. Reiterating my earlier comment, what I would really like to see here is the inherent attribute list nested under a ga4ghDigest (or even just ga4gh) keyword in JSON Schema.

So, I would propose that both specifications use something like: ga4gh.inherent to indicate inherent properties of a data class, and ga4gh.prefix for a ga4gh identifier, if applicable.

So in JSON Schema a class definition may look something like:

"$defs": {
  "MyDataClass": {
    "ga4gh":{
      "inherent":["inherentPropertyA", "inherentPropertyB"]
    },
    "type": "object",
    "properties": {
      "inherentPropertyA": { ... },
      "inherentPropertyB": { ... },
      "otherPropertyC": { ... }
    }
  }
}

Would this work for Seq Col?

ahwagner commented 1 month ago

@andrewyatz @nsheff @sveinugu @tcezard our next VRS 2.0 PRC meeting is on 10/31. One of our points of feedback was to document the ga4gh keys attribute (ga4gh/vrs#569); we would like to align with you before presenting this to the PRC. Would it be possible to get a decision on the above proposal before then so we may update our documentation accordingly?

nsheff commented 1 month ago

Really the only change we would make, then would be:

description: "A collection of biological sequences."
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable labels of each sequence (chromosome names)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    items:
      type: string
      description: "Refget sequences v2 identifiers for sequences."
  accessions:
    type: array
    collated: true
    items:
      type: string
      description: "Unique external accessions for the sequences"
required:
  - names
  - lengths
ga4gh:
  inherent:
    - lengths
    - names
    - sequences

I can work with this, but I don't like how inherent is no longer parallel to required.

At the end of the day, we can make it work however, but what is the rationale for the added complexity? I guess to collect all the related terms?

Assuming we add transient and passthru modifiers as well (#86), I suppose it makes sense for these to go under the same ga4gh term as well, like so:

description: "A collection of biological sequences."
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable labels of each sequence (chromosome names)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    items:
      type: string
      description: "Refget sequences v2 identifiers for sequences."
  accessions:
    type: array
    collated: true
    items:
      type: string
      description: "Unique external accessions for the sequences"
...
required:
  - names
  - lengths
ga4gh:
  inherent:
    - lengths
    - names
    - sequences
  transient:
    - sorted_name_length_pairs
    - sorted_sequences
  passthru:
    - alias
    - author

Edit: fix yaml typo to correctly specify ga4gh section is an object, not an array.

sveinugu commented 1 month ago

What about just namespacing the qualifier names, e.g. ga4gh_inherent? Then we can do that with all non-JSON Schema qualifiers, regardless of their position. We have e.g. a collated qualifier that resides under each property, and adding an extra 'ga4gh' there feels cumbersome.

ahwagner commented 1 month ago

The rationale for the ga4gh: as a containing term would be that this indicates a shared set of ga4gh extension conventions for representing, e.g., inherent properties, digest prefixes, transient properties. I imagine that these shared conventions are then registered with a GA4GH cross-product alignment body like TASC to help with future keyword alignment.

While I think this would also be achievable using a ga4gh_ prefix ahead of each term, I prefer the grouping under a shared domain, rather than by name prefix. This pattern is a little closer to how other keywords in JSON schema are used; function is determined by keyword data structures, as opposed to keyword substrings.

nsheff commented 1 month ago

I wouldn't do it with the local modifiers. But I'm fine with grouping them under ga4gh for the global modifiers, and in fact I prefer that to prefixing them all with ga4gh_.

sveinugu commented 1 month ago

Yeah, not sure I thought it was a good idea myself, just wanted to consider it. I think your suggestion looks good with the global qualifiers. It feels slightly off that local qualifiers are exempt, but that is not a good argument to not organise the global qualifiers in this way.

ahwagner commented 2 weeks ago

Alright. Just to be very clear here, there is a small difference between what I proposed in this comment and what followed. I am proposing that ga4gh is a JSON Object, with TASC-registered properties including inherent. The subsequent examples use ga4gh as a JSON Array, but I am not sure if that was an intentional difference or just a typo.

@nsheff can you comment? Was ga4gh as an Array intended, or useful in some way over an Object structure?

nsheff commented 2 weeks ago

Was ga4gh as an Array intended, or useful in some way over an Object structure?

No, that's just a typo. The way I wrote that is actually not correct yaml, and I'll edit that now :)

ahwagner commented 2 weeks ago

Okay. It sounds like there is sufficient consensus to move forward with the following syntax at the top-level schema definition:

ga4gh:
  inherent:
    - inherent_property_1
    - inherent_property_2

If there is any disagreement with this syntax specifically for the inherent property in SeqCol, please comment on this issue. I will otherwise leave this issue open to track implementation of this in SeqCol spec.

We will plan on implementing this for our GKS products and will reference this issue with TASC to implement the registry of ga4gh-specific JSON Schema keywords. I also think it would be worth considering registration of datatype-specific ga4gh keywords such as "collated" (SeqCol arrays) and "ordered" (GKS arrays); I will reference there for further discussion. Any thoughts on other terms (outside of ga4gh.inherent) should be discussed in the context of the TASC issue.

nsheff commented 1 week ago

Agreed.

nsheff commented 1 week ago

Ok this is approved and I've added it to the decision record.