ga4gh / refget

GA4GH Refget specifications docs
https://ga4gh.github.io/refget
14 stars 7 forks source link

Revise decision record: sorted_sequences #77

Open ahwagner opened 3 months ago

ahwagner commented 3 months ago

https://github.com/ga4gh/seqcol-spec/blob/4722c713a9a5692b255f8bdf9ab27c18ca297042/docs/decision_record.md?plain=1#L19-L21

The above text conveys some opinions about the relative unimportance and utility of this feature.

From my perspective, using this feature is important, as it allows us to design federated applications with a reduced overhead on implementations: if a collection of sequences from which a dataset was generated can be universally identified by a GA4GH digest, based purely on sequence content (and not names, lengths, or order of sequences), then a one-time digest is sufficient for answering the question "do these resources represent variants in the same coordinate space?". I think it is much more likely that two implementations across a network will have the same underlying sequence content, but describe sequences in different orders or with different names (and therefore may not be able to compute /comparison/:digest1/:digest2 if digest2 is an unrecognized permutation of digest1 by that instance). It therefore seems to me to that implementing this feature represents the most obvious case for describing Sequence Collections across a federated service network, instead of a case with limited / fringe utility.

I recognize that as someone looking at this spec from the outside, I may not be grasping the larger importance of the comparison function or how Sequence Collections are intended to be used. And in deference to the work and internal discussions of the SeqCol team I am not suggesting that this feature become required or an exemplar case for Sequence Collections. However, I think it would be more inclusive of potential adopters with a similar view as my own to simply describe this feature and its use in the decision record, and tone down any language that may appear dismissive of this use case.

nsheff commented 3 months ago

Ok, I can definitely clarify that rationale better, or tone down the dismissive language. It may actually be useful to try to dissuade people from using this, because, in my opinion, using this is a bad move that is less useful than using the other, better features of sequence collections. So, I think the dismissive language here may be warranted and useful and instructive, but I haven't effectively explained why/the alternative.

To understand your use case, one thing I'm confused about: You're looking to compare coordinate spaces:

"do these resources represent variants in the same coordinate space?".

But then wouldn't you use sorted_name_length_pairs? It's exactly for looking for shared coordinate systems. I guess you're saying you don't have sequence names -- but if there are no sequence names, then you don't have a coordinate space at all, and I don't see how you have variant definitions, either. How can you describe variants without sequence name? The variants have to annotate a certain sequence in the collection. So I'm missing your point here. You'd need sequence names, and sorted_sequences won't help you anyway, right?

A few other thoughts:

nsheff commented 3 months ago

I toned down the dismissive language in the ADR :)