ga4gh / refget

GA4GH Refget specifications docs
https://ga4gh.github.io/refget
14 stars 7 forks source link

Add sorted_sequences as recommended non-inherent attribute #71

Closed nsheff closed 4 months ago

nsheff commented 7 months ago

Some feedback from the PRC was that we could think about another RECOMMENDED non-inherent attribute to live alongside sorted_name_length_pairs, that would be a digest for the sequences that does not respect order. So, something like: sorted_sequences.

This digest would allow you to easily assess order-invariant equivalence of sequences without having to use the comparison function, which would be useful for some use cases.

nsheff commented 7 months ago

Here's some proposed text to add to the spec:


3.3 The sorted_sequences attribute (RECOMMENDED)

The sorted_sequences attribute is a non-inherent attribute of a seuqence collection, with a formal definition. We RECOMMEND all implementations provide this attribute. When digested, this attribute provides a digest representing an order-invariant set of unnamed sequences. It provides a way to compare two sequence collections to see if their sequence content is identical, but just in a different order. Such a comparison can, of course, be made by the comparison function, so why do we recommend this attribute be included as well? Simply that for some large-scale use cases, comparing the sequence content without considering order is something that needs to be done for In these cases, using the comparison function could be computationally prohibitive. This digest allows the comparison to be pre-computed, and more easily compared.

Algorithm:

  1. Take the sequences attribute and canonicalize the JSON (using RFC-8785).
  2. Sort the resulting digests lexographically.
  3. Add to the sequence collection object as the sorted_sequences attribute, non-inherent and non-collated.
nsheff commented 4 months ago

What was the decision on this? Add to the spec?

nsheff commented 4 months ago

Our decision on this was to make this an OPTIONAL and for now include it in the spec.

In the future if the number of proposed ancillary attributes grows, it could move to a separate document together with other ideas for ancillary attributes.

nsheff commented 4 months ago

ADR added, added to spec.