Revise decision record: sorted_sequences

ga4gh / refget

GA4GH Refget specifications docs

14 stars 7 forks source link

https://github.com/ga4gh/seqcol-spec/blob/4722c713a9a5692b255f8bdf9ab27c18ca297042/docs/decision_record.md?plain=1#L19-L21

The above text conveys some opinions about the relative unimportance and utility of this feature.

From my perspective, using this feature is important, as it allows us to design federated applications with a reduced overhead on implementations: if a collection of sequences from which a dataset was generated can be universally identified by a GA4GH digest, based purely on sequence content (and not names, lengths, or order of sequences), then a one-time digest is sufficient for answering the question "do these resources represent variants in the same coordinate space?". I think it is much more likely that two implementations across a network will have the same underlying sequence content, but describe sequences in different orders or with different names (and therefore may not be able to compute /comparison/:digest1/:digest2 if digest2 is an unrecognized permutation of digest1 by that instance). It therefore seems to me to that implementing this feature represents the most obvious case for describing Sequence Collections across a federated service network, instead of a case with limited / fringe utility.

I recognize that as someone looking at this spec from the outside, I may not be grasping the larger importance of the comparison function or how Sequence Collections are intended to be used. And in deference to the work and internal discussions of the SeqCol team I am not suggesting that this feature become required or an exemplar case for Sequence Collections. However, I think it would be more inclusive of potential adopters with a similar view as my own to simply describe this feature and its use in the decision record, and tone down any language that may appear dismissive of this use case.

Ok, I can definitely clarify that rationale better, or tone down the dismissive language. It may actually be useful to try to dissuade people from using this, because, in my opinion, using this is a bad move that is less useful than using the other, better features of sequence collections. So, I think the dismissive language here may be warranted and useful and instructive, but I haven't effectively explained why/the alternative.

To understand your use case, one thing I'm confused about: You're looking to compare coordinate spaces:

"do these resources represent variants in the same coordinate space?".

But then wouldn't you use sorted_name_length_pairs? It's exactly for looking for shared coordinate systems. I guess you're saying you don't have sequence names -- but if there are no sequence names, then you don't have a coordinate space at all, and I don't see how you have variant definitions, either. How can you describe variants without sequence name? The variants have to annotate a certain sequence in the collection. So I'm missing your point here. You'd need sequence names, and sorted_sequences won't help you anyway, right?

A few other thoughts:

You're neglecting the complexity of subsets -- what if one collection has an additional sequence added? Then the digests won't match anyway. I think you're overestimating the probability that people have different collections with the same sequence content, but in different order and with different names and no extra or missing sequences in the collection, which are the requirements you'd need to match on sorted_sequences. On the other hand, all this information is immediately provided by the comparison function.
The comparison function can be calculated even if the service only recognizes one of the digests, as long as you have the canonical representation of the other. If you don't have either representation, then that's where you have the problem. For that situation, we would need 'meta-servers' that broker digest requests to servers. But I think even if you're using sorted_sequences, you're still going to be in this situation a lot of the time, due to the above point.

ga4gh / refget

Revise decision record: sorted_sequences #77