Closed ahwagner closed 1 week ago
Or, if I may be so bold: the opposite way around?
What I like about inherent
is that is fits very well in conversation. "Should xyz be inherent?" just feels better to say than "should xyz be in keys?"
We read it as "should xyz be in GA4GH Digest keys?", and "what is the GA4GH Digest prefix?".
What is useful for our purposes is tying together all of the properties associated with the GA4GH digest into one object (see above examples). That said, if keys
as a variable name is unpalatable, maybe there are other aligned solutions we could target, e.g. ga4ghDigest.inherent
. Would that be preferable? Or perhaps ga4ghDigest.inherentProperties
, ga4ghDigest.keyProperties
, or something else?
Standardization is obviously what GA4GH is all about, so that is way more important than my personal preferences. So one way or another, I think your suggestion is a good one!
But just to add another argument for inherent
(or a variant of that): The word indicates something essential to think about when one considers whether a property should be part of the list. Since a digest is typically used to identify an object, one should consider whether the property in question is an integral part of the identity or definition of said objects, or to put it in another way, is an inherent
part of the object.
I am in any case only speaking for myself. Let's hear what others have to say!
@andrewyatz this is a key GKS / LSG alignment point that would be good to have you weigh in on; commenting here to bump this in your inbox.
Sorry we did speak about this at the latest meeting for refget. At first I didn't quite get the concept being described here but now I can see it. Also I didn't twig it was vrs 2.0.
I'm personally on the side for more alignment here since the two emerging products are both conveying the same concept. I suspect my comment will be similar to what @sveinugu has said here that alignment is important. I think perhaps there is one difference though, which on the whole might not be important and it does also build on @sveinugu that the inherent property tries to flag the inherent attributes. I guess we're reusing this concept to reflect keys. One is talking about inherent identity the other about keys for a hash.
So overall I'll put my hat on the side of alignment in the hope we can align even though semantically they're not 100% the same. They will be confusing for others. Unless we duplicate. That sounds worse
I think that using inherent
instead of keys
as an attribute name is fine; the idea is the same, the keys are selected due to their use as inherent attributes for creating the digest. Reiterating my earlier comment, what I would really like to see here is the inherent attribute list nested under a ga4ghDigest
(or even just ga4gh
) keyword in JSON Schema.
So, I would propose that both specifications use something like:
ga4gh.inherent
to indicate inherent properties of a data class, and ga4gh.prefix
for a ga4gh identifier, if applicable.
So in JSON Schema a class definition may look something like:
"$defs": {
"MyDataClass": {
"ga4gh":{
"inherent":["inherentPropertyA", "inherentPropertyB"]
},
"type": "object",
"properties": {
"inherentPropertyA": { ... },
"inherentPropertyB": { ... },
"otherPropertyC": { ... }
}
}
}
Would this work for Seq Col?
@andrewyatz @nsheff @sveinugu @tcezard our next VRS 2.0 PRC meeting is on 10/31. One of our points of feedback was to document the ga4gh keys attribute (ga4gh/vrs#569); we would like to align with you before presenting this to the PRC. Would it be possible to get a decision on the above proposal before then so we may update our documentation accordingly?
Really the only change we would make, then would be:
description: "A collection of biological sequences."
type: object
properties:
lengths:
type: array
collated: true
description: "Number of elements, such as nucleotides or amino acids, in each sequence."
items:
type: integer
names:
type: array
collated: true
description: "Human-readable labels of each sequence (chromosome names)."
items:
type: string
sequences:
type: array
collated: true
items:
type: string
description: "Refget sequences v2 identifiers for sequences."
accessions:
type: array
collated: true
items:
type: string
description: "Unique external accessions for the sequences"
required:
- names
- lengths
ga4gh:
inherent:
- lengths
- names
- sequences
I can work with this, but I don't like how inherent
is no longer parallel to required
.
At the end of the day, we can make it work however, but what is the rationale for the added complexity? I guess to collect all the related terms?
Assuming we add transient
and passthru
modifiers as well (#86), I suppose it makes sense for these to go under the same ga4gh
term as well, like so:
description: "A collection of biological sequences."
type: object
properties:
lengths:
type: array
collated: true
description: "Number of elements, such as nucleotides or amino acids, in each sequence."
items:
type: integer
names:
type: array
collated: true
description: "Human-readable labels of each sequence (chromosome names)."
items:
type: string
sequences:
type: array
collated: true
items:
type: string
description: "Refget sequences v2 identifiers for sequences."
accessions:
type: array
collated: true
items:
type: string
description: "Unique external accessions for the sequences"
...
required:
- names
- lengths
ga4gh:
inherent:
- lengths
- names
- sequences
transient:
- sorted_name_length_pairs
- sorted_sequences
passthru:
- alias
- author
Edit: fix yaml typo to correctly specify ga4gh section is an object, not an array.
What about just namespacing the qualifier names, e.g. ga4gh_inherent
? Then we can do that with all non-JSON Schema qualifiers, regardless of their position. We have e.g. a collated
qualifier that resides under each property, and adding an extra 'ga4gh' there feels cumbersome.
The rationale for the ga4gh:
as a containing term would be that this indicates a shared set of ga4gh extension conventions for representing, e.g., inherent properties, digest prefixes, transient properties. I imagine that these shared conventions are then registered with a GA4GH cross-product alignment body like TASC to help with future keyword alignment.
While I think this would also be achievable using a ga4gh_
prefix ahead of each term, I prefer the grouping under a shared domain, rather than by name prefix. This pattern is a little closer to how other keywords in JSON schema are used; function is determined by keyword data structures, as opposed to keyword substrings.
I wouldn't do it with the local modifiers. But I'm fine with grouping them under ga4gh
for the global modifiers, and in fact I prefer that to prefixing them all with ga4gh_
.
Yeah, not sure I thought it was a good idea myself, just wanted to consider it. I think your suggestion looks good with the global qualifiers. It feels slightly off that local qualifiers are exempt, but that is not a good argument to not organise the global qualifiers in this way.
Alright. Just to be very clear here, there is a small difference between what I proposed in this comment and what followed. I am proposing that ga4gh
is a JSON Object, with TASC-registered properties including inherent
. The subsequent examples use ga4gh
as a JSON Array, but I am not sure if that was an intentional difference or just a typo.
@nsheff can you comment? Was ga4gh
as an Array intended, or useful in some way over an Object structure?
Was ga4gh as an Array intended, or useful in some way over an Object structure?
No, that's just a typo. The way I wrote that is actually not correct yaml, and I'll edit that now :)
Okay. It sounds like there is sufficient consensus to move forward with the following syntax at the top-level schema definition:
ga4gh:
inherent:
- inherent_property_1
- inherent_property_2
If there is any disagreement with this syntax specifically for the inherent
property in SeqCol, please comment on this issue. I will otherwise leave this issue open to track implementation of this in SeqCol spec.
We will plan on implementing this for our GKS products and will reference this issue with TASC to implement the registry of ga4gh-specific JSON Schema keywords. I also think it would be worth considering registration of datatype-specific ga4gh keywords such as "collated" (SeqCol arrays) and "ordered" (GKS arrays); I will reference there for further discussion. Any thoughts on other terms (outside of ga4gh.inherent
) should be discussed in the context of the TASC issue.
Agreed.
Ok this is approved and I've added it to the decision record.
The VRS 2.0 specification uses a property (
ga4ghDigest.keys
) to indicate keys that are used in creating VRS computed identifiers. This has the same functionality as the SeqColinherent
attribute.This has been implemented across 16 VRS classes. The
ga4ghDigest
object is also used for other digest-related keywords, includingga4ghDigest.prefix
for VRS objects that are identifiable.Here are some examples:
This is an opportunity to align these terms before either standard is finalized. Is it feasible to reuse the VRS
ga4ghDigest
structure for Sequence Collections in lieu ofinherent
?