Should lengths and names be required properties in every sequence collection ?

tcezard commented 5 months ago

I wanted to summarise one of the outcome of today's call and clarify a comment made in the PRC feedback document:

I suggest that the specification would mandate that at least one 'required' attribute is used but would not define which one it is. Over time, Refget Collections specification would create recommendations for the use of 'required' fields for different domains e.g. for genome archives this might be to have TWO different Refget Collections digests: 'names, lengths, sequences' and 'accessions, lengths, sequences'.

In ADR from 2023-07-12, we decided that the only mandatory properties in a sequence collections would be lengths and names. The argument made today was that by requiring lengths and names to be present, we're potentially forcing these attributes in use cases where they are not relevant or in some case not available. The example given was that of a CRAM file that contains a digest for each sequence but does not contains the length.

The argument in favour of having required field is one of interoperability. Guaranteeing the presence of the two fields helps making different services compatible by always having common grounds.

Reading back the ADR from 2023-07-12, the rational does not feel about how lengths and names should be made mandatory but how sequences should not be made mandatory because it would have prevented the use case of coordinate space to be implemented. I think similar argument can be made about other use-cases we might not have envisioned.

@raskoleinonen, please correct me if I misrepresented your point

@nsheff @sveinugu @andrewyatz please chime in as any change would have to be made relatively soon.

nsheff commented 5 months ago

or in some case not available.

we've discussed this at length in the past; names can always just become 1-N integers, which is basically what you'd do in a computer if you didn't have names. Similarly, lengths can be easily computed from sequences. And if you don't have sequences and also no lengths, then you don't have a sequence collection.

So, there is no situation where names/lengths cannot be computed.

Keep in mind: we're not forcing anyone to use this structure for use case X -- this is just what you would make to compute a digest. In the case of a CRAM file where you have a digest for each sequence, and you want to compute a seqcol digest you could:

use names 1:N
compute lengths of sequences
construct canonical seqcol form
compute digest

So, it is possible. The thing is, you probably wouldn't want to do this, because you'd really want to just use a collection that had an identifier computed with the names from some provider -- then you'd just ignore them. Making names option does not change that you still wouldn't want to do this.

I recall weeks of discussion on this topic where we eventually concluded It is a bigger problem to make names optional.

I think the above, to me, reinforces that it makes a lot of sense to have names be mandatory -- because it also guides the user away from doing something that wouldn't be a great decision from an interoperability perspective.

sveinugu commented 5 months ago

@raskoleinonen I do think it would be useful to get a bit more details on the use case. The way I understood it, the context makes computation of the lengths difficult. Could you explain exactly why? Perhaps one way of doing this could be to look up the level-1 digest of the sequences in some seqcol service to receive the corresponding length array (and possibly names), of course given a point in the future when seqcol is more widely adopted.

@nsheff I have never really liked the idea of just naming the sequences with its index. I don't understand why one would want to do that, as it basically reduces interoperability. You can always just add an index-based names array on the user's side if there is a need for this. There will be no extra information content in the names array, and including it just makes the overall digest less interoperable. One could argue that there is information in the fact that there are no names in the collection that point to anything externally, but this fact could in any case be represented by a single binary value, where the most natural choice to me would be whether the names array is present or not. The way I remember it, agreeing with this decision was to me a compromise, personally a "choose your battles" decision. My intuition has always told me that if you do not have any names, you should not invent any.

The lengths array is to me very different. While it can be always be computed from the sequences array, this is often cumbersome. Basically, I think the lengths array represents a broadly useful summary variable of the "level 3" contents of the sequence collection, i.e. the refget-expanded sequences. Making lengths required was to me a way to give a certain weight to the end-user experience. Here, my intuition tells me that being able to depend on the lengths to be always present can be very useful downstream. In many cases, the lengths themselves can be used to map between different patches of the same reference genome, at least for the primary sequences. Knowing at a glance the size of the sequences of the genome would be very useful in many contexts and I honestly am surprised that this is not already a part of the CRAM headers, as it is in e.g. GFF3. I would think having this available in a CRAM file would be great for e.g. quality control.

So to me, requiring the lengths array to always be present is a way to push some of the cumbersomeness from the end users to the data providers. Hence, it is not really a counter-argument that this is cumbersome. In my mind, it is meant to be a bit cumbersome. This is in my mind a feature of seqcol.

raskoleinonen commented 5 months ago

I suggest that there is two levels of the specification: 'core' and 'community'. The 'core' level describes the scope and format of the sequence collections, the terminology, how digests are computed, and has a list of standard fields with definitions. The decision of which fields are required and which are inherent would be in the 'community' part of the specification.

This would allow the 'core' to be approved without discussing which fields are required and/or inherent in different contexts. My intuition tells me that having any standard fields in the 'core' as required and/or inherent is a significant distraction from approving the 'core' as a GA4GH standard. For example, my opinion is that lengths is neither inherent nor required. I suggest that opinions about field usage should not have a impact on the 'core' specification.

sveinugu commented 5 months ago

@raskoleinonen I believe the overall idea in your suggestion is very much in line with the consensus of the group. I think the main point where we differ is that we have defined the required and inherent properties of the main three fields as part of the 'core' specification, as the choices here have consequences for core interoperability aspects. I don't believe leaving the specification of which of the other fields are required and inherent to a 'community' part is controversial; in fact, this is exactly what we have been thinking. So the question here is not really about the core/community split as such (however the way this is organised/presented can of course be improved), but whether the core specification shall enforce inherent/required properties of names and lengths, which we have been debating at length and ended up with an ADR on, a decision which this issue is now reopening.

sveinugu commented 5 months ago

I can also add that the discussion has mostly been about the names array. As far as I can remember, there have been no real disagreements on the lengths array and I follow @sheff in that without the restrictions on the lengths array, the seqcol specification turns into something very different than what we have have spent a lot of effort into creating. Thinking more about this, not requiring lengths destroys much of the point of the specification for me, as it is very much about harmonising the coordinate system-oriented community with the sequence-oriented community, and the lengths array is all about that.

EDIT: Even though I am not fully happy with requiring the names array, I agree with @nsheff that if having names mandatory is what is needed to also keep the lengths array mandatory, then we made the right decision.

nsheff commented 3 months ago

Just another thought here:

If you have neither names nor lengths, and you just have sequences, then what you want is just the level 1 sequences array digest.

That's going to be part of the main digests.

I guess the question is this: do we want to make it so that you can define a top-level digest with that same content? (just the sequences array?)

If we do this, then there will be 2 digests that refer to that : the top-level and the level 1 digest, since that's basically a sequence collection with only a single attribute (sequences).

The problem here is just that now, nothing is required. So an implementation can't even rely on the fact that there are unique identifiers (names) for the sequences in the collection.

I guess I still argue that for this use case, users should use level 1 digests. I guess there's a pretty major downside to this, though, and that is that these would not be compatible with the APIs in the same way as top-level digests.

The major advantage of doing it this way, though, is that those digests will perfectly match the sequence digests of other collections that do have names/lengths, which is nice.

But I'm open to more discussion on this topic.

nsheff commented 3 months ago

Also: remember, for a particular use case, someone could just say "I'm not using the canonical minimal schema. I'm using my own schema, and in my schema, names and lengths are not required". That's fine. You'd just lose interoperability with other implementations that do require them, that's the cost.

nsheff commented 2 months ago

In recent discussions, we've moved a bit away from even having a "core schema" -- so now the schema is depicted as an example.

I think this should probably solve the issue for all parties, because you can still be "seqcol compatible" even if using a completely different schema -- which means you can define your own set of required/inherent attributes.

This comes at the cost of interoperability, though -- but that seems to be what is wanted. So.

andrewyatz commented 2 months ago

Having re-read this, I believe @nsheff I agree that moving away from a "core schema" is the right thing to do. I would suggest the schema isn't positioned as an example but an exemplar and recommend maintaining the semantic meanings of attributes if someone does diverge from it

sveinugu commented 2 months ago

My fear is that we are throwing baby out with the bath water by removing the concept of the core schema. This is basically what we have been spending years to discuss and have landed on and is an important outcome of that work. To me, the core schema IS the main core contribution of the standard, it is the thing that might fix the issues we have been facing for 20 years and we should not take it lightly to take that out of the standard.

andrewyatz commented 2 months ago

I certainly understand that aspect. However there is broader utility in the seqcol spec beyond sequences where you have a central immutable concept that can be represented as a hash (sequences, container images, annotation releases). It feels like there has to be a way we can describe the core spec whilst not avoiding the major improvement of the schema.

Also just something I've noted. The BAM spec states that @SQ:SN and @SQ:LN headers are both required attributes. CRAM's spec suggests CRAM header blocks are the SAM header with the addition that @SQ:MD5 is required. The original ticket included a line about CRAM files which lack sequence lengths but I am not sure how that is possible.

nsheff commented 2 months ago

I agree with you @sveinugu -- but basically it seems like people don't want interoperability, they want to just solve the one use case they have in the simplest way possible, which by necessity eliminates the ability to use the same thing in other contexts. In the end, people who value the interoperability can still agree upon a shared schema as we have done, and I'm hopeful that if enough momentum builds around that shared schema it could create a community resource where it's worth the extra effort by people outside that central community to buy into the interoperable network built on that core schema... but at the beginning, lacking such a network existing, it's a harder sell. And even if we do build up such a community around an interoperable schema, really, it doesn't really prevent the standard from being useful in a silo, in addition to that .

nsheff commented 3 weeks ago

with the recent changes including the attribute endpoint, this changes things for this issue of core schema.

@sveinugu what's your position now on what should be required/inherent in the core schema?

I guess it seems that taking lengths to no longer required/inherent makes more sense now.

ahwagner commented 1 week ago

Like @nsheff, I agree with @sveinugu that a core schema does seem like the central property for the work and would be a shame to drop. I also like @andrewyatz's idea of an exemplar, it seems like a good approach that doesn't exclude other use cases down the road. More on this later.

Just adding my $0.02 here, but I wonder if the core issue that is being exposed is that there exists exactly one top-level digest property, that inherently supports one view of how collections should be digested. And we are seeing that various participants in this community are exposing different ideas about what attributes are relevant to sequence collections in different contexts. A similar way of thinking might be what motivated the PRC comments @tcezard brought up at the top of this issue.

While addressing this by adding the attribute provides a path for other implementations, it still delegates other views of what sequence collections should be to an ancillary component in the spec. This might give the perception that broad interoperability across these other use cases is an ancillary objective.

A few ways I think this could be addressed are:

Taking the suggestion of the PRC to not specify which properties are required
Providing a mechanism for creating multiple top-level digests that reflect different community interests (and reconsidering type prefixes to allow these to be distinguished from one another)
Creating a core specification and one or more constrained specifications on that core which reflect different community use cases

ga4gh / refget

Should lengths and names be required properties in every sequence collection ? #72