ga4gh / refget

GA4GH Refget specifications docs

https://ga4gh.github.io/refget

14 stars 7 forks source link

What information is included within the string-to-digest? #8

Closed nsheff closed 6 months ago

nsheff commented 3 years ago

We need to confirm the final contents of the string to digest. Right now, I think we had settled on these elements:

name (e.g. chr1)
length
topology
sequence digest (refget digest)

constraints:

Required: name, length, topology (sequence digest may be left blank)
topology must be one of: 'linear' or 'circular'
length must be an integer
name must be a string, subject to the constraints determined in issue #2

Related to this, I initially described this with a JSON-schema that I use in my implementation, but this is a bit more complicated because it encodes the recursion (the fact that the sequence digest element is itself a top-level refgettable element., But, for reference if anyone's interested, it's here:

https://github.com/refgenie/seqcol/blob/master/seqcol/schemas/AnnotatedSequenceList.yaml

Open for discussion.

nsheff commented 3 years ago

One comment that came up in our discussion (thanks @jmarshall) is that perhaps we should relax the constraint of requirement for name, length, and topology.

I originally had everything required, and then dropped 'sequence digest' from required because I had a use case where it would be useful, so I guess the others are still listed as required simply because that's how I started. But now as I think about it more, I think this is right; there are potentially useful reasons to make a "sequence collection" that has nothing but sequence names, or a set of sequences without names, or something like that, and then you'd be able to use just some subset of seqcol functionality that only requires some aspects of the annotation.

So, I think it's right that going forward, there's no reason we can't just make everything optional.

tcezard commented 3 years ago

During the meeting I was questioning the reasoning behind including the topology in the digest. Basic argument is that regardless of the topology stated in the collection the representation of the sequence is still linear and the topology is an attribute of biological sequence that could be held in the metadata rather than the digest.

@rishidev pointed to minutes from last year (2019) meeting which discuss this issue. One point seems to be that the adding the topology in the digest would enable the "recording" that a sequence was analysed as a circular sequence and so it should be recorded in the sequence collection associated. To me this seem like an analysis metadata though

@yfarjoun @andrewyatz Can you recollect the rational behind adding topology in the digest?

tcezard commented 3 years ago

So, I think it's right that going forward, there's no reason we can't just make everything optional.

A side question is: when a field is missing (or empty) does it have to be empty in all the elements of the collection? For example: can I have a collection where the sequence digest is missing in one or a few of the elements?

yfarjoun commented 3 years ago

The issue was (and still is) that the same sequence could be included in a sequence dictionary in either linear or circular topology, but the choice can affect how downstream tools will understand the data.

Originally we thought of putting it into the sequence (refget) itself, but that was deemed inappropriate since it was "metadata" not information about the sequence itself.

Thus, in order to be able to distinguish a dictionary that has MT/linear from MT/circular, we have to include the topology somewhere...it could be in the sequence collection, or it could be in an additional layer that embeds and annotates a sequence collection... we could include an annotation wrapper that embellishes the collection and its elements with arbitrary tags. That would be the most flexible solution and it would not make the topology a special case.

On Wed, Dec 9, 2020 at 11:31 AM Timothee Cezard notifications@github.com wrote:

During the meeting I was questioning the reasoning behind including the topology in the digest. Basic argument is that regardless of the topology stated in the collection the representation of the sequence is still linear and the topology is an attribute of biological sequence that could be held in the metadata rather than the digest.

@rishidev https://github.com/rishidev pointed to minutes from last year (2019) meeting https://docs.google.com/document/d/1h1KauiREYLhJXW-J-vMeb0Fu2CKasrwCkb55XkB4J2U/edit# which discuss this issue. One point seems to be that the adding the topology in the digest would enable the "recording" that a sequence was analysed as a circular sequence and so it should be recorded in the sequence collection associated. To me this seem like an analysis metadata though

@yfarjoun https://github.com/yfarjoun @andrewyatz https://github.com/andrewyatz Can you recollect the rational behind adding topology in the digest?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/seqcol-spec/issues/8#issuecomment-741888828, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUQVPGQZQVUVKC3SFW3ST6Q6LANCNFSM4UTR2OCA .

andrewyatz commented 3 years ago

So, I think it's right that going forward, there's no reason we can't just make everything optional.

A side question is: when a field is missing (or empty) does it have to be empty in all the elements of the collection? For example: can I have a collection where the sequence digest is missing in one or a few of the elements?

I believe that every sequence should have all four elements we've indicated here, or to be more specific it'll have a name, length, checksum and topology is assumed to be linear unless specified otherwise.

sveinugu commented 3 years ago

One possible use case for sequence-less seqcolls:

Say you only have the chromosome and length information for a bunch of reference genomes, which would be the case for instance in a default installation of Galaxy. I think it would then make sense to create a database of these, and making use of seqcol API mostly for the comparison functionality.

Generally, I believe, in line with @nsheff that making elements optional would open up for a range of functionality that would be useful for various "real-life" challenges, especially for issues related to track analysis, which both @nsheff and myself have experience with (e.g. https://doi.org/10.1093/nar/gky474).

sveinugu commented 3 years ago

As to the issue of topology (and other aspects discussed here), I think we really need to provide a clear definition of what a sequence collection is. I see at least two directions here:

A sequence collection is just that: Named and ordered sequences (i.e. character arrays) collected together, and not much more. From this point of view, topology is metadata.
A sequence collection is a model used to represent reference genomes and other very similar concepts that are typically represented by a collection of named ordered sequences. From this point of view, topology is a central aspect, as a linear and a circular version of the same sequence are different models.

Also the second way of thinking could extend towards graph genomes, at least to a set of sequences describing a sample set of personal genomes from which a graph genome can be generated. As mentioned elsewhere in this project (cannot remember where), GRCh38 does really fall into this category due to the alternative sequences. Under this view, a seqcol might also cover a set of coordinate systems without sequences. The obvious difficulty then is to decide on where to draw the line of what is in scope or not.

andrewyatz commented 3 years ago

I personally believe it moves towards the second point of view. That topology is important.

Also the role this will play with graph genomes is an interesting one. This for me depends on what the unit of work of a graph genome will be. Do we want to refer to common builds/paths through a graph of haplotypes or something deeper? I think those are open questions for the moment and this standard looks to resolve the existing issues with reference genomes which will remain linear for some time to come & continue to play a part in many clinical diagnostic pipelines

nsheff commented 3 years ago

We've just had a lot more discussion on this point in the refget call. We have gone back and forth; in the end I think the consensus at the moment leaning toward option 1, that is, not including topology as a fundamental to a sequence collection, and then building the topology metadata information on top of that as an extra layer of information. But, we are still having some debate and we see compelling arguments in both directions.

Arguments for not including topology in the digest are:

topology is not inherent to a sequence; the sequence itself in either linear or circular form is the same, at least as stored in a computer. The digests are really describing the sequences as we store them in the computer.
Is there a real-world case where you'd have a linear and a circular version of the same sequence? MtDNA for example is always circular.
the 'circular' vs 'linear' is a function of a particular analysis, not of a sequence. It is surely useful to annotate sequences with biological information like topology for a particular analysis case, but why would you want that inherent in the digest of the sequence collection.

Arguments for including topology in the digest are:

SAM headers include topology as a field. So, it would be nice to be able to query a server with a digest, and get all the required information back -- including topology. If topology is included as "metadata layer", it could still be returned by the server, but then you're not guaranteed to get the same thing back for a particular digest from different servers, because the topology information was not included in the digest.

nsheff commented 3 years ago

Other metadata like topology?

What other information is there that is of the same class as topology? One thing that came up is a sequence mask. A mask is not inherent in a sequence, but annotates a sequence. The difference here is that a mask annotates individual elements in the sequence and can be represented by regions, for example; the topology is an annotation of the entire sequence.

But another thing that annotates a primary sequence is "primary" vs "secondary". We previously discussed annotating sequence collections with a way to flag which elements are core elements, like the assembled chromosomes in the human genome, and which are secondary, like unassigned contigs or other random sequences. This feels like the same type of information as topology.

nsheff commented 3 years ago

Here's a proposed solution

Sequence collections are ordered lists of sequences/lengths, not including topology or mask or primary/secondary annotation.
We think of topology the same way we'd think of a sequence mask: it's an additional piece of information that lives on top of a sequence collection.
If all you need is a "pure" sequence collection (no topology or mask), you simply use the digest, and you're guaranteed to get what you need. I think this fills 95% of use case, with the simpler version
For use cases that requires an annotation (topology or mask or primary/secondary annotation), then you'd need 2 pieces of information. You'd have to provide the sequence digest, and the annotation information.
We (or someone) could build separate services that provided unique identifiers for 'masked sequence collections' or 'topology-aware sequence collections', which consist of a pure sequence collection and an additional set of information. We could have checksum algorithms that would accommodate this.

andrewyatz commented 3 years ago

Latest solution to this gets my vote.

sveinugu commented 3 years ago

We've just had a lot more discussion on this point in the refget call. We have gone back and forth; in the end I think the consensus at the moment leaning toward option 1, that is, not including topology as a fundamental to a sequence collection, and then building the topology metadata information on top of that as an extra layer of information. But, we are still having some debate and we see compelling arguments in both directions.

Arguments for not including topology in the digest are:

topology is not inherent to a sequence; the sequence itself in either linear or circular form is the same, at least as stored in a computer. The digests are really describing the sequences as we store them in the computer.

I guess this comes down to what the goal of seqcoll is: Are we trying to just describe the sequences as they are currently stored, or do we try to improve upon the current state? As a latecomer, I am not exactly sure of what the goal is, but personally I would think improving on the current state is a worthy goal. So a scenario (which is not very far-fetched) is that some analysis system that do handle circular and linear genomes differently (which all really should) would be able to, from a bunch of digests, be able to grab the correct sequences and preprocess them according to topology, without knowing what they represent in terms of biology. So I would say topology should have been a part of the sequence info from the get-go (e.g. encoded in FASTA), and that we now have a possibility to improve on that lack.

Is there a real-world case where you'd have a linear and a circular version of the same sequence? MtDNA for example is always circular.

Probably not, but I think a more relevant scenario is the one I mention, as the topology information is crucial for managing the sequence correctly, even though many tools don't care about this. Without this, one needs to add more command-line parameter to the tool, and I think we already have plenty of those.

I do however think that we could try to make such a parameter more abstract, in order to support later extensions, such as "sequence_type" or similar.

the 'circular' vs 'linear' is a function of a particular analysis, not of a sequence. It is surely useful to annotate sequences with biological information like topology for a particular analysis case, but why would you want that inherent in the digest of the sequence collection.

I disagree. I don't think it is a function of the particular analysis. However, it is not a function of the sequence if by sequence you mean the contents of a FASTA file. However it is a function of the mathematical model of a DNA (or whatever) sequence, which is really what you need for analysis.

Arguments for including topology in the digest are:

SAM headers include topology as a field. So, it would be nice to be able to query a server with a digest, and get all the required information back -- including topology. If topology is included as "metadata layer", it could still be returned by the server, but then you're not guaranteed to get the same thing back for a particular digest from different servers, because the topology information was not included in the digest.

The centrality of SAM to all things in bioinformatics makes this a heavy argument in my mind.

sveinugu commented 3 years ago

Other metadata like topology?

What other information is there that is of the same class as topology? One thing that came up is a sequence mask. A mask is not inherent in a sequence, but annotates a sequence. The difference here is that a mask annotates individual elements in the sequence and can be represented by regions, for example; the topology is an annotation of the entire sequence.

I'm a bit uncertain what a sequence mask would represent here. I assume it is not the same as the Ns in the sequence (hard mask), but perhaps soft masks, like what was done with repeating regions (https://www.biostars.org/p/98042/)? So the repeating regions is clearly metadata annotation (a track in itself), while the hard mask is a part of the sequence information (I think we all agree on including N's in the sequences). So I would say that topology here is at the level of the hard masks, which we do include.

A side-note of relevance: One of the main issues creating lots of false positives in the analysis of track files is that most common formats (except GFF in theory, but not in practice) do not include mask-type information about which regions of the DNA the track is defined. Most tracks are e.g. not defined in N-rich regions (e.g. centromeres) and other regions with a high number of short repeats. It is very different that a lack of a region is due do technical reasons (e.g. that the current analysis window is within the centromere) or that the lack of a region represents a real lack of some genomic feature. So this is essential information about the track file that should always have been included (in addition to having an identifier of the reference genome of course). It can be estimated from the reference genome, but only the creator of the track would know if there are other constraints that would have made it impossible for a region to appear in certain areas. See e.g. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2438-1.

So I would argue that this is the similar type of scenario where essential information is lacking from the sequence files due to some historical judgements by someone. Topology is one of these types. Primary/secondary is another, as you discuss below:

But another thing that annotates a primary sequence is "primary" vs "secondary". We previously discussed annotating sequence collections with a way to flag which elements are core elements, like the assembled chromosomes in the human genome, and which are secondary, like unassigned contigs or other random sequences. This feels like the same type of information as topology.

So what about sequences of other types than DNA? I assume we are supporting those also, but would it not be a point to know which kind of sequence we are talking about. But instead of having an ever-increasing list of terms for this, one could e.g. denote the possible characters used in the sequence, e.g. acgnt (sorted order) for basic DNA sequences, but there are plenty of other options out there: https://en.wikipedia.org/wiki/Nucleic_acid_notation. This string would be very different for protein sequences (if that is to be supported). It would be extremely useful for a parser to be able to know the character set in advance to check if one can handle the sequence. Also, in case a specific set of characters is in theory supported, the lack of use of those in the sequence does also carry information (similar to the track example above).

EDIT: GFF only supports leaving out bits in the beginning and end, not in the middle of sequences, but I have anyway never seen this used.

nsheff commented 3 years ago

Array-based approach

Introduction

In discussion at the refget meeting, we came up with a few new insights on this. For example, following the proposal above, if we include topology outside of the core sequence collections as a standalone array... this leads to the idea of encoding the names, sequences, and lengths as arrays as well, instead of as properties of a sequence, bundled together. This has a few nice advantages, and some disadvantages as well.

I think a useful way to conceptualize the question is, what is the schema for data included. We considered a few possible schemas. Here I'll try to summarize, it's easiest to show with a simple example of a sequence collection with 3 sequences.

Here, we use an array of objects, each with 3 properties: name, length, sequence. This is the way I've been thinking about it so far.

seqcol = [{'name': 'chrUn_KI270742v1',   'length': '186739',  'sequence': '2f31c013a4a8301deb8ab7ed1ca1cd99'},
 {'name': 'chrUn_GL000216v2',   'length': '176608',  'sequence': '725009a7e3f5b78752b68afa922c090c'},
 {'name': 'chrUn_GL000218v1',   'length': '161147',  'sequence': '1d708b54644c26c7e01c2dad5426d38c'}]

A new proposal is, we can just think of it as 1 object, with 3 properties: an array of names, array of lengths, and array of sequences.

seqcol = {'names': ['chrUn_KI270742v1',   'chrUn_GL000216v2',   'chrUn_GL000218v1'],
  'lengths': ['186739', '176608', '161147'],
  'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99',   '725009a7e3f5b78752b68afa922c090c',   
'1d708b54644c26c7e01c2dad5426d38c']}

Side note: digesting each array separately

This leads to the possibility of digesting these arrays separately. If we digest each one separately it looks like:

{ 'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
 'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
 'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43',
}

Now, this is kind of nice because we get a sequence-only digest, and a name-only digest, etc. Maybe we would only digest the sequences, and not the names/lengths? Or maybe none of them.

Adding in topology

But regardless of whether we digest each array separately, another advantage of using separate arrays is that it seems to be nice for fitting in more information, like topology. So, for example:

{'names': ['chrUn_KI270742v1',   'chrUn_GL000216v2',   'chrUn_GL000218v1'],
  'lengths': ['186739', '176608', '161147'],
  'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99',   '725009a7e3f5b78752b68afa922c090c',
   '1d708b54644c26c7e01c2dad5426d38c'],
   'topologies': ['linear', 'linear', 'linear'],
   'priorities': ['primary', 'primary', 'secondary']}

This feels clean; the seqcol is just a bunch of named arrays, and you use whichever ones you need. Each array must be the same length. If we digest each option individually and then digest them together, then for example, one seqcol in "flat form" could be:

{'topologies': '',
 'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
 'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
 'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43',
 'priorities': ''}

Another could be:

{'topologies': '951c378c80dd043fe97bbb44c22bd7afb6931573905e3a13',
 'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
 'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
 'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43',
 'priorities': ''}

We can immediately tell they have the same sequences/lengths/names, but one has topologies and the other doesn't. And of course, we can recurse to get whatever layer of information we need.

Issues with the array-based approach

This seems appealing to me, but there are a few issues:

It makes all these secondary annotations part of the main checksum, so you're kind of forced to consider these things we want to be optional. So, I kind of miss that the simple, core things, sequence/name/length, can't be specified as a separate unit.
It fixes the "secondary annotation" possibilities, so we'd have to pre-define these. Maybe that's a good thing, maybe not. topologies, priorities -- anything else? masks? I guess this can probably go on forever... topologies and priorities are fairly straightforward because they're just boolean vectors, really -- but masks gets much more complicated, and I'm not sure we want to tackle that now.
I'm not sure I'm sold on the need to digest each array separately. But, we can still use the arrays without that aspect of it.

Solution to issue 1

A solution to problem 1, we can introduce a new structural layer, say, rawseqcol, that holds just the "Big 3": sequences, lengths, and names:

{'rawseqcol': '{'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
    'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
    'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43'}'
 'topologies': '951c378c80dd043fe97bbb44c22bd7afb6931573905e3a13',
 'priorities': ''}

In just 1 layer it looks like:

{'rawseqcol': '5655f27ed3dd7b8bb7c108c4ddc562979f82bc294877988c',
 'topologies': '951c378c80dd043fe97bbb44c22bd7afb6931573905e3a13',
 'priorities': ''}

So, this adds the complexity of an additional layer in the object, but gives the ability to use that 'rawseqcol' digest independently, if you don't want an 'annotated seqcol' with all the other stuff. So now, you'd end up with a checksum for the raw sequence collection (without topologies), and this is what I already implemented in a demo server.

Solution to issue 2

But this doesn't solve the second problem above. How do we define the metadata that sits alongside topology, like 'priorities', and 'masks' and things like this? Where does it stop? is it fixed? An alternative could be to allow arbitrary "tags" with data, so, it would look something like this. At the top level is a split into seqcol and annotation objects:

{'rawseqcol': '5655f27ed3dd7b8bb7c108c4ddc562979f82bc294877988c', 'annotation': '9cxkw...' }

The annotation object groups all that "other stuff" (outside the "Big 3"), and looks like:

{'annotation': [{'name': 'topology', 'value': ['linear', 'linear', 'linear']},
  {'name': 'priority', 'value': ['primary', 'primary', 'primary']}]

It's now flexible, so 'topologies' is not hard-coded into the schema...but you can provide it by specifying the 'name'. So you have to name your object as a variable ("topology", "priority", etc), and then provide its value, instead of hard-coding those possibilities in the schema. So, this is kind of nice because it can be used for any sort of future annotation. But -- it also explodes the possibilities, so now you can make a checksum for whatever kind of notations about your sequence you want, and now we've lost a chance to standardize and nobody's digests will match. It also makes it harder to define the checksum algorithm, since somehow that identifier will need to be included in the digest, instead of just inferred. So, this is starting to really complicate things, and I'm not sure it's worth it.

Discussion

After thinking through these options for several hours, I come back to the core question: what elements are part of the identity of the sequence collection? I think all the things we've discussed here are valuable, and @sveinugu raised some really good points about how important topology and masks and priorities are. But we need to separate out the components that belong in the digest from those that just annotate that digest. This seems like the critical question: what elements are part of the identity of the sequence collection? Because the rest of it, while valuable, seems to me like it should be included somewhere, but does it really need to be digested? That's what we're trying to decide here; not general importance, just what belongs in the unique, global identifier.

I think we don't need to make "Sequence Collections" solve every problem. It can be a very simple thing, that will be easy to implement, easy to use, and very universal... and easy to extend with whatever other information eventually makes sense to extend with. Then we or others can build things that extend it. So, I think it's more important that it be extensible, simple, and re-usable than comprehensive.

My opinion on including each item in the digest

Name

On one hand, name is not much different from topology. The name of the sequence is not inherent in the sequence itself, either. So, what's the rationale for including it? Well, I think the name is really something separate -- it's an identifier of a sequence. So, it seems appropriate to include on those grounds as an identifier. Furthermore, a huge number of analyses critically depend on names, and if they differ, then the analysis results differ. In a sense, by including name, we've already lost the "only include things inherent in a sequence" argument, and we've conceded that sequence collections are not a pure abstract representation of data, but reflect analysis use cases. In that sense, the only argument against including topology is that its use is nowhere near as widespread or as important as name, and it's not part of original fasta format, so we often lack that information.

Lengths

There's an argument that lengths shouldn't be included in the digest, either, since they're redundant with sequences. Providing them is one thing, but we could technically provide them in the API without including them in the digest. Given a set of sequences, the lengths are fixed, so why include them? Well, one reason is that we want sequence collections to be able to represent "Chromsizes" files, or coordinate systems, which consist of just names and lengths (no sequences). So, in this case, you'd have to include the lengths in the digest, because there are no sequences. To make a single digest algorithm be able to handle both 'chromsizes', as well as actual sequences, it seems to make sense to just require lengths to always be included. The alternative would be to come up with a separate digesting algorithm for coordinate systems...

In that case, maybe a coordinate system is names+lengths, and keep sequences separate:

{
  'coordsys': {'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2', 'chrUn_GL000218v1'],
            'lengths': ['186739', '176608', '161147']}
  'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99','725009a7e3f5b78752b68afa922c090c',   
'1d708b54644c26c7e01c2dad5426d38c']
}

Now, we can digest the sequences separately. I suppose this is a possibility. Then, maybe our primary digests could be built from a tuple of (coordsys, seqcol). When I think about it this way, topologies seems to fit in better into the 'coordinate systems' data model, and then I might do something like:

coordsys:
  names: [array]
  lengths: [array]
  topologies: [array]
  priorities: [array]
sequences: [array]

Topologies

I'm still torn on this one. I think sequence names are practically widespread and useful enough to belong in the digest, but on topologies/priorities, I'm not really sold that they are practically useful or widespread enough to justify including in the core definition. These can be handled outside sequence collections. The advantage I see to excluding them is that in our core seqcol identifier set, there will be fewer canonical sequence collections, and so they will be more reusable, because we won't have the problem of "the identifier with the topologies included vs the one without". Because, as great as it would be if all seqcols included the proper topologies, many people will not bother, and the data doesn't always even exist, leading to multiple sets of near-equivalent identifiers in widespread use (which is basically where we are now). But -- including them means they can be used, at least, and maybe that opens some new use cases, without too much of a cost. So, I don't know.

Priorities

If we include topologies, I don't see why we can't also include priorities.

Masks

If we include priorities, I want to include masks. But masks feel like a much bigger issue, and I'm afraid this causes the problem to balloon out of control. Therefore, I want to set the cutoff much smaller, and then deal with these other issues in the next project.

Final thoughts

I like the simplicity of the independent arrays. Therefore, I would probably choose representing a sequence collection like this:

seqcol = {'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2',   'chrUn_GL000218v1'],
  'lengths': ['186739', '176608', '161147'],
  'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99','725009a7e3f5b78752b68afa922c090c',   
'1d708b54644c26c7e01c2dad5426d38c']}

lengths is required.
names and sequences are optional. topologies and priorities -- maybe.

I would probably not re-digest everything, so we create the string-to-digest by concatenated array items with one delimiter, and then arrays with another. The order would be lengths;names;sequencedigests, so for example (here shown with arbitrary delimiters):

186739,176608,161147;chrUn_KI270742v1,chrUn_GL000216v2,chrUn_GL000218v1;2f31c013a4a8301deb8ab7ed1ca1cd99,725009a7e3f5b78752b68afa922c090c,1d708b54644c26c7e01c2dad5426d38c

Side note: one nice thing about digesting each one independently is that we only need 1 delimiter.

Then, we start another, separate project surrounding extending that unit with additional information (topologies/priorities if not included, and masks, or anything else).

andrewyatz commented 3 years ago

To add a few items from the refget meeting today

Splitting into separate arrays

Broad agreement that this is the right way to approach the problem and offers the most flexibility to ourselves and implementors to bring in new data concerning sequences (considering the list has grown to include topologies and priorities and masks it's likely this will continue to grow).

Attribute importance

A sliding scale was proposed of attribute importance roughly running like

sequence digest
name/length
topology
priority
masks

Digest algorithms

A few solutions were suggested but concerns were raised over the absence/presence of key fields and the flexibility/inflexibility of the digest i.e. how do we ensure digests do not need to be recalculated all the time. Suggests to avoid this included

Namespaces which denoted the algorithm version used and having it as a fixed set of fields
Namespaces which denoted the fields used and allowing any order to be used (though likely this will could cause implementation issues)
Using multiplication of checksum digits to create collection level identifiers (multiplying checksums is not dependent upon order)
Namespacing the digests with what they were and therefore the algorithm can add/remove fields as needed

nsheff commented 3 years ago

Here's an approach. Build a string-to-digest like this:

3,lengths,152,252,424,names,chr1,chr2,chr3,sequences,bc90a07cf2,aee3d420,72c915234c

The first number is the number of items in the array. All arrays must have that number of items. Before each delimited set of items is the identifier for the category (lengths, names, sequences). These must be in alphabetical order.

Now to make an identifier without sequence digests, you'd do:

3,lengths,152,252,424,names,chr1,chr2,chr3

To add in topology, you'd do:

3,lengths,152,252,424,names,chr1,chr2,chr3,topologies,linear,linear,linear

The advantages of this approach is that it's flexible with respect to what we can put in the string-to-digest, so it's backwards compatible if we add something in the future, since there's no affect on the digest to leave one of the possible arrays out.

nsheff commented 3 years ago

With a second delimiter, we could avoid the need to put the number of items in the array. This is a more flexible approach as it would allow arrays to contain different numbers of items. Maybe not necessary for this particular use case, but it makes it more universal.

lengths,152,252,424>names,chr1,chr2,chr3>topologies,linear,linear,linear

lengths>152>252>424,names>chr1>chr2>chr3,topologies>linear>linear>linear

tcezard commented 3 years ago

Couple of questions or comments

We will need to have a list of keywords to specify each fields: what would these be?
- sequences
- names
- lengths
- topologies
- priorities
- masks
Do we still want/need to have mandatory fields or are they all optionals as long as one is present ?
If we still require some to be provided what are those? Previously we mentioned lengths and names as required and the top post also mentions topology.

sveinugu commented 3 years ago

Some comments after the last meeting (which seems to have turned into more of a novel!):

@nsheff I really like the recursive seqcol implementation, and I believe we are almost there now.
Also, I think I agree on not including a string to signify which arrays are included (such as 'lns', for lenghts, names, sequences) in the top-level identifier. For any use-cases that would need this, it should be stored alongside the identifier as metadata.
@tcezard. In my mind, the important question regarding which arrays to support in the standard is whether it makes sense to include such information when using sequence collections digests as identifiers in some sort of applied service. specifying which arrays to include makes it possible to tailor the uniqueness property of the digests to a particular use case. A simple way to think of such use is to ask whether it makes sense to sort some kind of data/metadata into buckets defined by a subset of the possible arrays (i.e., that all the data linked to sequence collections with identical digest, as defined by a set of arrays, are put in the same bucket).

Example: As has been argued previously, one natural application would be to use seqcols to replace 'chromlen' files for track analysis (simple tabular files that include the sequence names and lengths). The current implementation would remove the need for such files completely, as their content would be directly available from a single API call (recursion level 1). For such type of use, one could with the current idea be able to put all track files made from a single major version of the human reference genome (e.g. GRCh38) into one bucket, regardless of the patch version used to generate the track. This is because patch updates only update the sequence content but not the chromosome coordinates, i.e. names and lengths). This is how the genome reference metadata fields are typically categorized today, for instance in genome browsers.

EDIT: On further thought, I don't think the above is actually true. I believe some patches add alternative sequences. Also, placing previously unplaced sequences within a chromosome will remove those unplaced sequences from the full list, thus changing the digest. So in order to provide the required buckets using seqcol, one would need to define collections with only the core chromosomes. This is what is really meant by specifying e.g. "GRCh38" as the reference genome of a track file, as one typically does not care about the non-chromosome sequences (except perhaps MT). Supporting track files is something I believe is a very important use case for seqcol. It is also relevant for this (more or less randomly selected) new publication: https://f1000research.com/articles/10-268. My complete suggestion below would solve this issue.

Based on the "bucketing" notion, I have tried to think a bit about the current suggested array types (and some new ones), and possible natural combinations of them, as follows:

A. sequence

This is obviously essential. Being able to bucket data/metadata according to the sequence content, regardless of the sequence names, is obviously useful, especially in the context of the UCSC vs. rest of the world divide on chromosome names. We can probably come up with many examples of this, and you probably already have (I did not double-check with the use case document ).

B. names + lengths

I have already explained the usefulness of this for track file analysis.

C. sequence + names (+ length)

This would be the most common way of bucketing e.g. reference genomes. Adding lengths (or other directly sequence-derived arrays) would not change the bucketing ability of the digest, other than defining a second set of buckets with the same properties (the only difference being whether lengths were added to the digest calculation or not). In such a context, being able to easily see the difference between the two types directly from the digest might be a plus, but one could also simply store this info as metadata, filter out one of the digest types as part of data import, etc. The same logic applies to the sequence + lengths option.

D. names (labels?) only

I feel that there should be some use cases for this, but I have a hard time coming up with scenarios where adding either the sequences or the lengths (or both) would not be a better choice. (Btw: should we rather refer to names as "labels"?).

E. lengths only

Don't think so.

A-E. core rules

From the above considerations of the 3 most "core" arrays, we could define the following two core rules:

i) Either sequences or names (labels?) or both must be present ii) lengths must always be present. (This would remove the "double bucketing" issue for at least the three core arrays.)

Rule i) and ii) results in 3 possible combinations of core arrays.

Rule ii) also has as a consequence that one would always be able to prepare for a full sequence download at the level of the first recursion level, for instance by preparing some data structures at the appropriate lengths, or selecting API for sequence retrieval based on the size (e.g. allowing small sequence downloads from the seqcoll API, but switching to another API for larger sequences).

Assuming these two core rules, let's continue with the other possible arrays (which would then never be found without the core arrays, something which makes total sense to me):

F. topologies

So we have discussed this back and forth for a long time. There are scenarios where such information is crucial for an algorithm, but one could also in those cases just require topology to be added as additional metadata. As topology is unlikely to change in different versions of the same genome reference (other than such information being present or not), it might not seem very useful to add topology (in addition to the core arrays) when defining the buckets. It will only work as a flag on whether topology info is being taken into consideration or not. However: I believe adding the possibility will not hurt much in the current implementation idea and it will have as a consequence that topology information would become more easily available and would thus, in practice, also be used more for algorithms/services that can make use of it (assuming that adding topology info is often done through an optional tool parameter that most people just skip, which I believe is a reasonable assumption). Hence, just including topology as an optional array in the standard would most probably result in a slight increase in overall data/result quality in the life science field, which is not a bad argument, IMO.

G. priorities (sequence groups?)

I believe this would be same/similar to what in the NCBI assembly database is referred to as "Sequence role", in combination with "Assembly Unit", see e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_assembly_report.txt (from https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26). This basically represents a subdivision of assembly sequences into smaller groups:

In e.g. the human genome, the assembled-molecule/Primary Assembly combo contains the chromosomes, the assembled-molecule/non-nuclear combo contains the MT sequence, the alt-scaffold role contains the alternate sequences, and the unplaced-scaffold and unlocalized-scaffold roles are the unassembled sequences.

I believe such subdivisions are important information to have available in a machine-readable way, and having this information in a seqcol array is an easy way to support this. However, for identifier-based bucketing, adding such a subgroup category array would only work as a flag on whether such information is available or not, across all sequences, which is of limited use. In contrast with the topologies array, adding the subgroup info to a sequence would probably not change how that sequence is handled by itself. The most natural usage scenario is to make use of such subgroups for filtering. However, in practice this is typically done prior to analysis, so that the sequence collection used in the analysis step would no longer match the full collection from which it came. Such filtering leads to genomes such as the b38 genome (containing no alternate sequences, I believe), used for mapping, variant calling, etc. However, one could quite easily support identifying many of these sub-genomes by carefully defining a subgroup array and then automatically calculate digests for the sequences that are included in each subgroup, or in combinations of them:

e.g. if one divides human sequences into: chroms, MT, alt, unplaced, I can at least see use for the following combos: chroms + MT, chroms + alt, alt + unplaced (=decoy?). For four subgroups, there are 16 combinations, so there is in that case not much extra overhead in calculating all combinations.

The problem is obviously the general applicability of such categorization across sequence collections. For instance, what about protein sequences? However, I believe the advantages, especially in light of the FAIR principles, are so great that I think we should explore this path.

H. masks (sequence gaps?)

One can define many possible types of masks, but we have not (in the meetings I have been present, at least) delineated this much. There is at least one type of mask I would argue would be very useful as an array, namely a mask specifying the gaps in the assembly, perhaps in combination with contig placement (such as CONTIG in https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11). For DNA, one could also derive such gap info directly from the 'N' characters (which I believe would include also smaller gaps). There would be similar masks derivable from other alphabets, e.g. proteins, where gaps will be encoded differently (I believe, by a period character).

Example: There are situations where one would be interested in such assembly gap information without needing access to the sequence itself, for instance when randomizing the position of feature regions (such randomization should avoid the assembly gaps, e.g. the centromeres/telomeres). Since almost no track files really include basic information of which parts of the genome where the tracks are defined, having assembly gap information as part of the reference genome digest would be an easy way to attach such info to track files. This would improve the analysis of tracks tremendously, as the lack of a feature region in a loci also carries information, but only if such a lack is not due to a gap in the assembly. See e.g. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2438-1. Putting the track files into buckets with the same set of gaps in the assembly (independently of the base pairs) would be useful to find out which files can be analyzed together under the same assumptions. This is a use-case that is directly relevant for me.

I. alphabet

I would like to add 'alphabet' to the list of arrays. In my mind, this is not the same as the summary of the characters used in the sequence, but the possible characters that could have been used. Similar to the assembly gaps use case above, the fact that a character is not used in a sequence position is a separate piece of information given that we know the character was allowed in the first place. Hence, the alphabet (defined in this way) adds information to what is inherent in the sequence itself, and one can easily imagine sequences that are the same but are defined in relation to different alphabets. Having the alphabet available in recursion level 1 (together with the lengths) would also help if one needs to pre-register some data structure before the sequences are downloaded. Also, even if this is not the main argument, having the alphabet available would be a simple way to differentiate between e.g. DNA and protein sequences, without having to add a sequence type array/metadata field (which is btw also be a possibility, of course, but I think I will stop here...).

F-I. rules for non-core arrays

If all these non-core arrays (and more) are supported, there will be a large number of possible combinations of them. In this case, given 3 core configurations and 4 extra arrays, there are 3*2^4=48 total combinations. If one has all this information available, calculating the digest for all of the combinations seems a bit over-the-top, especially if in combination with different subgroups. Hence, it is probably smart to place the extra arrays into groups where all arrays in a group are required to be present, at least if one is generating digests for various combinations of arrays. Here is a suggestion of such rules:

iii) topology, alphabet and sequence gaps are optional, but must be added together as a "sequence model" entity. iv)sequence groups are optional, but if added, digests should be created for subcollections of sequences with a particular value in the sequence groups array, as well as for all combinations of these. v) Digests for all allowed combinations of arrays according to the above rules should be calculated (in combination with the subcollection calculations).

Conclusion

In the case of 4 sequence group values, there are 3 (core) 2^2 (extra) 2^4 (seq groups)= 192 combinations to be calculated for each full sequence collection. This of course represents a bit of overhead, but we are still not talking about much data.

Using the above rules, one would for instance automatically have a resolvable identifier for the "chromlen" data of core human chromosomes in GRCh38, including sequence model information, something which would be incredibly useful for genomic track analysis tools/scripts. Instead of having to manually find and add such data using relevant command line parameters, one would just need to specify the seqcol digest, and the tool could automatically do the rest directly using the seqcol API. For me personally, this would solve some major implementation issues in an elegant way.

So what so you think? There is a huge number of details to discuss here, but there is also a discussion to be had on a more philosophical level on what kind of problems seqcol should solve. I am, as you can easily conclude from the above, in the "we have an opportunity to be heroes" camp!

And sorry for the length of this post...

nsheff commented 2 years ago

Above is a lengthy discussion on what to include in the array-to-digest. This thread happened as we transitioned to the array-based digest method (also addressed in issue #10). Anyway, the new array-based approach changed how we conceptualized what is included in the sequence collection, and made it much more flexible. We sort of see it as being able to hold whatever arrays a user wants, with a few controlled vocabulary terms. It gives a lot of flexibility to the user, and also reshapes the question here a bit, to: which arrays should be mandatory? This is addressed in pending ADR in PR #18.

We haven't really finalized this, so I wanted to finish this. It's clear from discussion that sequences should be optional, and lengths should be mandatory. But what about names? Should names be mandatory?

A corollary is whether and where we enforce uniqueness. It's clear to me that we don't enforce uniqueness for lengths. I guess we also don't enforce uniqueness for sequence. But what about for names?

I think @daviesrob had some thoughts here as to the mandatory-ness and uniqueness constraints of the names array. Can we come to a conclusion here, now that we have increased clarity on what a seqcol is?

andrewyatz commented 2 years ago

Making names mandatory

From memory the argument here is that a collection could be composed only of names and lengths. We also said it was likely a collection would define either names of sequences to stand alongside the lengths. I also think we suggested someone might just duplicate the checksums in sequences to provide names if we made names mandatory.

If the above is a good recollection then I would support making names/sequences required but only require one of those in a given collection. This also seems to align with a number of @sveinugu comments about length alone would not be a valid collection.

Uniqueness of names

The comment about sequences being non-unique from memory is because two independent sequences could create the same checksum but have different associated metadata. Is there a reason why a sequence collection would be invalid if a collection had a total duplication of an entry? Sequence, name & length were just duplicated. I personally think if we can answer that question (perhaps we have done last week) then the uniqueness of names would be an easy thing to answer. If you can't duplicate an entry in a collection then names can be unique

nsheff commented 2 years ago

I think you're right, Andy. I like the concept of "total duplication".

Is there a reason why a sequence collection would be invalid if a collection had a total duplication of an entry? Sequence, name & length were just duplicated. I personally think if we can answer that question (perhaps we have done last week) then the uniqueness of names would be an easy thing to answer. If you can't duplicate an entry in a collection then names can be unique

Maybe to just restate this question in another way: Which of the following data structures is a sequence collection?

a ordered set
(ordered) list

Since our entries are split out into separate arrays, It will be easier to implement if we allow total duplication of an entry, because we must use lists under the hood to accommodate duplication within an array. However, if we don't need total duplication allowed for some reason, then implementing a check to prevent it may give us some nice computational properties down the line, such as being able to assume that the concatenation of name + sequence is unique, or that name must be unique.

I suppose we could enforce disallowing total duplication of an entry by simply requiring and enforcing that names be unique.

daviesrob commented 2 years ago

Unique names are required for some use-cases (e.g. SAM and VCF). Unless we can think of a really good case where having duplicated entries is necessary, I think enforcing name uniqueness would give us some very handy properties.

nsheff commented 2 years ago

Looking over my notes from our previous discussion on this (2021-08-25) we had I think approached consensus that names should be required, and should be unique. If you don't have a names array, you should just generate one with numbers 1...x as the sequence identifiers. This provides uniquely identifiable sequences.

I can't really think of a situation where this would bite us. You could always stick your non-unique identifiers into some other custom array if you needed that.

So:

names are mandatory, and must be included to provide uniquely identifiable sequences
names need to be unique.
This, in turn, will disallow total duplication of an entry.

sveinugu commented 2 years ago

names are mandatory, and must be included to provide uniquely identifiable sequences

I am a bit uncertain about requiring names in that it removes one of the more straightforward applications of the sequence collections standard: to provide a single unique digest for a collection of sequences, given only the sequences themselves. So extracting the lengths from the sequences is simple and uniquely defined, there is no choice do be done for this. However, having to generate a names array with unique values (when this information is not present) can be done in a multitude of ways, with indexing from 0...n being one and duplicating the sequences array being another. Each of these will result in a different digest for the sequence collection. As a consequence we loose one of the simplest use cases where one can uniquely and consistently create one (and only one) digest from a bunch of sequences, given that there is no other metadata.

This BTW is also an argument for providing an order-invariant digest algorithm, as having to enforce an order into an unordered set of sequences is also a process where there are several canonical ordering schemes to choose from, each of which will result in a different digest for the same set of sequences.

In fact, defining a canonical ordering scheme to be able to uniquely generate a single order-invariant digest, given order is missing/not important (which I argue for in https://github.com/ga4gh/seqcol-spec/issues/5#issuecomment-1011061095) seems similar to defining a canonical naming scheme to be able to uniquely generate a single naming-invariant digest, given names are missing/not important. Even more, it seems the choice of defining a sequence collection as an ordered list of sequences opened up the need for also defining a canonical ordering scheme in cases where order is not defined. Similarly, making the choice of requiring the names column opens up the need for defining a canonical naming scheme in cases where names are not defined.

For me, if we are unable to provide a deterministic function:

f([list of sequences] => unique digest

then I suspect that we have over-engineered things.

I have no objections against requiring unique names, but if the only argument is to disallow total duplication, I believe we might be adding unnecessary constraints.

sveinugu commented 2 years ago

In conclusion, my suggestion is to require lengths and one or both of sequences and names. If sequences are allowed to have duplicates (which I think should be the case), then total duplication of an entry would by consequence be allowed. However, it makes sense to me to require names to be internally unique, so that one can depend on the names to uniquely identify individual sequences. Then, in scenarios where total duplications should not happen, one might decide to enforce this by adding an ad hoc requirement that the names array is present. However, I don't think it should be required by the standard itself.

nsheff commented 2 years ago

it removes one of the more straightforward applications of the sequence collections standard

Well, if we define "default names", then this application is not removed. Your argument is right, but only if we don't specify the approved way to construct default names.

the only argument is to disallow total duplication, I believe we might be adding unnecessary constraints.

Could be, but there are multiple arguments:

disallow total duplication (as you mention). this does have some value.
for compatibility with SAM and VCF (Rob mentioned)
it provides a natural, easy-to-understand, easy-to-implement solution to the canonical ordering question: simply order by sequence names.
it enforces the common way we think of things, that the names are identifiers. I think this could actually prevent some downstream bugs and issues.

daviesrob commented 2 years ago

One way out of the problem, where you have sequences but no names might be to derive the names from the sequence checksums. For example, this python script will read in a list of checksums, and make names by combining the first six characters with a counter to deal with clashes. The counter values are set based on an alphabetical sorting of the checksums, preserving the original order for duplicates.

#!/usr/bin/python3

import fileinput
from collections import defaultdict

list = []
for line in fileinput.input():
    list.append(line.strip())

seen = {}
count = defaultdict(int)
for hash in sorted(list):
    prefix = hash[0:6]
    count[prefix] += 1
    seen.setdefault(hash, count[prefix])

for hash in list:
    prefix = hash[0:6]
    print(f"{prefix}-{seen[hash]:06d} {hash}")
    seen[hash] += 1

So this input:

6aef897c3d6ff0c78aff06ac189178dd
f98db672eb0993dcfdabafe2a882905c
9f26945f9f9b3f865c9ebe953cbbc1a9
3210fecf1eb92d5489da4346b3fddc6e
a811b3dc9fe66af729dc0dddf7fa4f13
9f26945414693fe0a2399cf82e707e03
5691468a67c7e7a7b5f2a3a683792c29
cc044cc2256a1141212660fb07b6171e
cc044cc2256a1141212660fb07b6171e
c67955b5f7815a9a1edfaa15893d3616
9f26945414693fe0a2399cf82e707e03
6c198acf68b5af7b9d676dfdd531b5de
9f2694147561b361091791b9010cd28b
c0eeee7acfdaf31b770a509bdaa6e51a

gets these names:

6aef89-000001 6aef897c3d6ff0c78aff06ac189178dd
f98db6-000001 f98db672eb0993dcfdabafe2a882905c
9f2694-000004 9f26945f9f9b3f865c9ebe953cbbc1a9
3210fe-000001 3210fecf1eb92d5489da4346b3fddc6e
a811b3-000001 a811b3dc9fe66af729dc0dddf7fa4f13
9f2694-000002 9f26945414693fe0a2399cf82e707e03
569146-000001 5691468a67c7e7a7b5f2a3a683792c29
cc044c-000001 cc044cc2256a1141212660fb07b6171e
cc044c-000002 cc044cc2256a1141212660fb07b6171e
c67955-000001 c67955b5f7815a9a1edfaa15893d3616
9f2694-000003 9f26945414693fe0a2399cf82e707e03
6c198a-000001 6c198acf68b5af7b9d676dfdd531b5de
9f2694-000001 9f2694147561b361091791b9010cd28b
c0eeee-000001 c0eeee7acfdaf31b770a509bdaa6e51a

This has the property that sorting by name gives the same ordering as if you sorted by sequence checksum, which might be useful, e.g. for generating a canonical order:

3210fe-000001 3210fecf1eb92d5489da4346b3fddc6e
569146-000001 5691468a67c7e7a7b5f2a3a683792c29
6aef89-000001 6aef897c3d6ff0c78aff06ac189178dd
6c198a-000001 6c198acf68b5af7b9d676dfdd531b5de
9f2694-000001 9f2694147561b361091791b9010cd28b
9f2694-000002 9f26945414693fe0a2399cf82e707e03
9f2694-000003 9f26945414693fe0a2399cf82e707e03
9f2694-000004 9f26945f9f9b3f865c9ebe953cbbc1a9
a811b3-000001 a811b3dc9fe66af729dc0dddf7fa4f13
c0eeee-000001 c0eeee7acfdaf31b770a509bdaa6e51a
c67955-000001 c67955b5f7815a9a1edfaa15893d3616
cc044c-000001 cc044cc2256a1141212660fb07b6171e
cc044c-000002 cc044cc2256a1141212660fb07b6171e
f98db6-000001 f98db672eb0993dcfdabafe2a882905c

sveinugu commented 2 years ago

Could be, but there are multiple arguments:

disallow total duplication (as you mention). this does have some value.

Yes, it does have some value, although I am uncertain of how important this is.

for compatibility with SAM and VCF (Rob mentioned)

This is a very good argument. The difference between saying "seqcol is compatible with SAM and VCF" and "seqcol is compatible with SAM and VCF given that a names array was used when generating the digest" is large and I think I am sold by this argument alone.

it provides a natural, easy-to-understand, easy-to-implement solution to the canonical ordering question: simply order by sequence names.

It does in the cases where the sequence names are already given and/or the sequences already have a natural ordering. However, if one wants to create a name-invariant seqcol digest (if names are not available or are unimportant) and the order of the sequences are not given, one still needs to first order the sequences according to some deterministic algorithm. The fact that a names array derived from the indices of the sequences can be then used to order the sequences is a circular tautology that does not help a bit in this scenario. Basing the names array on the sequence digests is a better solution, however it feels more like an ad hoc solution to a particular edge case issue than a general algorithm for naming/ordering. I will describe a suggestion below that I believe will solve this and most of the other core issues we are discussing (hint: it is a further development of my most resent idea for solving the ordering issue...)

it enforces the common way we think of things, that the names are identifiers. I think this could actually prevent some downstream bugs and issues.

I definitely agree that enforcing unique identifiers would be very useful downstream. However, I feel the above is really only an argument for requiring names to be unique IF they are present, but not an argument for enforcing that the names column as a whole is mandatory. One can argue that it would be an advantage in many typical implementation scenarios to know a unique 'names' column will always be present. However, if unique keys are needed, one can always just generate them by adding indexes, etc.

However, in conclusion, I agree that there are strong arguments for making the names column mandatory. But then I would argue that we should also define a canonical way of generating names if they are not present, and I will provide a suggestion for such an algorithm below that I believe will provide a bunch of advantages.

Edit: I will try to write out my suggestion tomorrow. Too late for me today.

sveinugu commented 2 years ago

So I have now tried to write out my idea in full (and then some).

(Warning: novel-length comment follows! This can be a nice read over the weekend, under a blanket by the fireside with a cup of tea...)

Before going into the details, I would like to take a step back and provide some thoughts as to what we are actually trying to accomplish with the seqcol standard, the way I see it (with the reservation that I was not part of this project from the start and might have misunderstood some fundamentals):

1. Create a persistent, globally unique identifier for a collection of sequences as defined in a repo of some sort, with a particular focus on reference genome repos.

2. That the identifier should be a deterministic and unique digest of the sequence collection contents, where the contents are defined in some (more or less) standard data structure.

3. Provide a standard for an endpoint for fetching a sequence collection given the identifier

4. Provide a comparison function that can return a useful overview of important similarities and differences between the contents of two sequence collections

5. Allow for use of the above in scenarios where the exact sequences are not important, e.g. to allow the creation of identifiers for coordinate systems that contain names and lengths, but not the actual sequences. The most prominent use case of this is as metadata for genome position annotation files (e.g. BED/SAM/VCF files) [which is what I would call "genomic track" files, as argued for here and onwards, but that is a digression].

That is more or less my impression of our goals. The last goal seems to be the least understood of the ones in the list, where I assume I have contributed significantly to the confusion by arguing at length and in great detail in issue https://github.com/ga4gh/seqcol-spec/issues/5 why I believe that an identifier that refer to a coordinate system should be based on an unordered data structure. One lesson drawn from this discussion is that it seems we are actually talking about three different types of identifiers:

A. Identifiers that match one-to-one to with a particular collection of sequences where the sequence content matters (goal 1 above). This is useful for database interoperability ("how to fascilitate comparison of the contents of different reference genome repos?"), for keeping data provenance ("what were the actual sequences used to generate this file?"), and a bunch of other scenarios.
B. Identifiers that match one-to-one to a particular chromSizes (or similarly named) file, as used in e.g. UCSC tools or BedTools, in which the sequences are ordered. This would e.g. be practically useful to not have to maintain/include a particular chromSizes file when analyzing e.g. a BED file in a command line tool. Instead of providing a chromSizes file, which creates extra hassle in e.g. workflow scenarios, one could instead replace a chromSizes file with an identifier given as a string parameter to the tool.
C. Identifiers that match one-to-one to a particular coordinate system in its mathematical sense, where the order of the sequences do not matter. Such identifiers would be useful in at least some contexts, such as when building a file tree for a collection of BED files based on different reference genomes that are variations of the same major genome version (e.g. hg38). In this scenario, identifiers of type C could be used as the names of directories containing files that can be analysed together since they are based on the same coordinate system in its mathematical sense, even if the underlying reference genomes the BED files are based on have ordered the sequences differently.

(Btw: this will be more than a rehash of the unordered/ordered discussion, just be patient with me...)

Identifier type C might seem the least important of the lot, and in one sense I agree with this. However, it is the only identifier type that can be used to fully represent the "common shared denominator" of several sequence collections if these are differently ordered. Identifiers of type B can easily be replaced by identifiers of type A in scenarios where you are only analyzing e.g. a single BED file (and where you in any case should preserve the A type identifier to maintain provenance). However, in scenarios where you want to identify a shared coordinate system for several files (as in a Genome Browser), such an identifier should really be of type C, I believe. The alternative is to use an identifier of type B (or A) and say, in a sense, that the ordering of sequences in B (or A) is the canonical ordering and that all other sequence orders must be converted to follow B, even if there are no good reason to pick any particular order over another.

At this point you might be thinking that, yes, having to use type A or B instead of C might be a nuisance, but it should not keep you awake at nights, so to speak. We need to focus on A and possibly B. And if this was all that there was to this, I would properly agree.

However, there is a fourth type of identifier that we have not discussed in a principled way in the seqcol meetings:

D. Identifier that match one-to-one to a sequence record within a sequence collection.

This is really what the names column is used for, and one of the issues plaguing Bioinformatics has been the inability of the community to settle on a standard for this. Instead, we still have to work around the chr1 vs 1 debacle. [Btw. there was a tongue-in-cheek sentence somewhere in the seqcol GitHub repo or web site that we were finally going to fix this, but I cannot find it now. Have we reduced our ambition level? :)] So the Refget digest is obviously a step in the right direction, but the reason why we still argue for making a name column mandatory must be that the sequence digest alone does not cover all the needs. I believe the main reason for this is the use of e.g. chr1 to refer to the coordinate system that is derived by the appropriate version of the reference genome (hence goal 5 above). However, if the reference genome identifier is missing or imprecise, it is impossible to know what exactly is meant by chr1. As a conclusion, we aim to provide the identifier of type A to solve this (Goal 1), so that the combination of the seqcol identifier and the name chr1 is a precise and globally unique identifier of the sequence!

But we have had persistent unique sequence collection identifiers all along for genome assemblies, like the GenBank assembly accession: GCA_000001405.29 (which refers to hg38 patch 14 that was released a week ago!). I believe one main thought behind seqcol identifier type A is to replace such repository-dependent accession numbers with universal content-dependent digests. But there has been nothing stopping the community from embracing annotating e.g. BED files with the corresponding assembly accession numbers. However, imprecise identifiers like hg38 are still used, especially for genome annotation data (tracks). For instance, in the Galaxy framework, there is only one metadata field for reference genomes and it is by default based on UCSC naming (i.e. hg38). I believe the reasons for this are at least two-fold:

1. users provide whatever metadata is required by tools, with the UCSC Genome Browser being a prime example of this.

2. That hg38 catches something the accession numbers (or a seqcol identifier A) do not

Regarding point 2, I believe hg38 is in practice used to categorize files that follow the coordinate system derived from the lengths of the (often only core) chromosomes of the latest version of the human reference genome, where the names follow the UCSC standard (to simplify interoperability). In addition, hg38 is humanly readable (and easily writeable, I still have to double-check every time I writeGRCh38). Moreover, I have argued at length that in most cases hg38 logically refers to an unordered coordinate system (identifier type C instead of B), but in the current argument this is in any case a minor point.

My though has been all along that if we want people to replace hg38 with a seqcol identifier, we should make sure that it solves the same problem. However, I am now slowly coming to realize that a seqcol identifier will in any case not fill the same role as hg38, at least not an identifier of type A. In practice, we aim for a seqcol identifier to replace the assembly accession numbers in current use, but such identifiers are in any case underutilized in the genome annotation/track file analysis scenarios described above. I have argued for at least providing the possibility for an order-invariant seqcol digest algorithm in order to make it possible to design a seqcol-compliant repository based on type C identifiers (as there will be at least one user of this, which is myself). But even if I will find such an unordered identifier very useful in my tool development, it has little practical chance to topple the hg38 dominance for analysis of genome annotation/track files for linear reference genomes, and I believe the only real possibility for seqcol identifier dominance in the user community (=bioinformaticians) going forward is to be sure seqcol ids are adopted for graph genome analysis scenarios (which we have hardly talked about).

So this long rant is really questioning whether we are really trying to solve the correct questions related to goal 5 (coordinate systems). So if the real issue is that the names used to identify sequences are imprecise and incompatible (i.e. chr1 and 1), perhaps we should add as another goal for seqcol the following:

6. Define content-derived globally persistent and unique identifiers for single sequences within sequence collections (let's call these seqids).

If each element within for example a BED file is annotated with a persistent identifier to the sequence it refers to, one can make sense of the data without requiring that the BED file itself is annotated with a precise enough reference genome identifier. I believe GFF, SAM and VCF have all gone this route, allowing the sequences to be precisely specified within the headers inside the file itself. For a BED file (maintaining compatibility by not adding any headers) this would mean exchanging the chr1 column with a seqid digest. For compatibility, I believe this would actually be the best choice, given that there are machine-actionable ways of relating this identifier with chr1 or 1 or whatever. Every elemental annotation unit (e.g. a ChIP-seq peak) with be uniquely related the sequence from which it came, but with no need to change the file format.

So could not the refget digest be used as a sequence id? Well, I don't think so, as the main advantage of chr1 is that it is dependent on the length of the sequence only and not the exact sequence contents, allowing comparison of data e.g. across minor patches of the reference genome. Such an identifier would also be dependent on e.g. topology, as one wants to analyse regions on circular sequences differently than regions on a linear sequence. As I have argued for elsewhere, making the identifier dependent also on assembly gaps might be useful in certain scenarios. The point being that for analysis scenarios where chr1 is used today, we could precisely replace this with a seqid that is a digest of a set of seqcol arrays for a single sequence, possibly including the sequence array, but in analysis scenarios probably not, so that the seqid instead refers to a coordinate system.

So what if we could support the following workflow?:

I. Input file i: A legacy hg38 BED file with UCSC-type chromosome names.

II. Input file ii: Properly annotated seqcol-compliant BED file based on the seqcol:af236b sequence collection (identifier type A), where all regions are annotated with type D identifiers (another digest type, e.g. seqid:64ga09) in the names column, uniquely referencing the sequence within the collection seqcol:af236b.

III. Tool X is a BED file conversion tool that unifies the input files to use sequence names that precisely identify the coordinate system in the particular context needed for the analysis. For file i, it translates chr1 into the matching seqid identifier, say seqid:3119ga, but this one refers to a sequence collection element that represents a coordinate system (i.e. , where the sequence array is dropped). This conversion might require a chromSizes file or might use some other logic or user input to map towards the correct seqcol/seqid identifiers. Note that the coordinate system records do not need to be hosted in any repo, as the digest can be directly generated from a chromSizes files or a full seqcol record. Similar conversion is done for file ii, however since the seqcol identifiers of the sequences have already been specified, there is a simple one-to-one match between this and the related coordinate system identifier (in this example, the digest of the coordinate system derived from sequence item seqid:64ga09 by removing the sequences array value might also be seqid:3119ga). The output of tool X are two new BED files with the appropriate seqid identifiers in the names column that should match between the files given that they refer to the same concept of a sequence.

IV. Tool Y is an analysis tool that might or might not take advantage of the features that the seqid identifiers provides. Standard and legacy software that takes BED files as input should work with no changes required.

V. References to seqids in any output files or other results can be translated back to e.g. chr1 through APIs, libraries, or similar.

So how would this be implemented? Here's my suggestion:

names	seqids	lengths	sequences	topologies
chr1	seqid:64ga09	150000000	refget:a39a48	linear
chr2	seqid:fa01482	120000000	refget:001f16	linear
chr3	seqid:7cc309	110000000	refget:911b00	linear

So the seqids column are generated as a digest of the per-sequence values in the other arrays except for the names column. We should require that the seqids are unique within a collection, which I believe in most bases will not be a problem. In cases where there is full duplication for all arrays excepts names, one would need to add another array to split them. In most cases I suspect there will be a natural such array that should be included. If not, one can always, as a standardized last resort, add a duplicate_index array that, for each total duplication, counts from 0 (or 1) to the number of identical copies, e.g.

names	seqids	lengths	topologies	duplicate_index
chr1	seqid:64ga09	150000000	linear	0
chr2	seqid:fa01482	120000000	linear	0
alt_1	seqid:7cc309	12345	linear	0
alt_2	seqid:47g880	12345	linear	1

In addition to the advantages of including globally unique and persistent per-sequence identifiers that can be customized to almost any need, there are also a bunch of advantageous technical properties of this solution:

Since the seqids array contains digests of the other arrays except the names array, the top-level digest can be created by digesting the names and seqid arrays only.
seqids are globally unique and can be used to compare individual sequences in different sequence collections, regardless of their names. In the comparison API, we would get extra counts of all the overlaps where everything but the name is equal.
Addition to the last point: for seqids to be comparable across sequence collections, these must include the exact same arrays and follow the same vocabulary and logic. By this argument we might want to introduce a concept of subtypes or templates or similar that can be adopted in a standard way in a particular subgroup of seqcol repos, say across reference genome repos. Other arrays and vocabularies can be standardized for e.g. graph genome seqcol repos, or whatnot. We could define some of these, while others are free to define their own, if needed.
The names and seqids arrays provide a simple map to convert between different names, e.g. from chr1 to seqid:64ga09 in one seqcol and from seqid:64ga09 to 1 in another. This requires that the names array also are unique within a collection, but I don't think anyone has any trouble with this requirement.
As the seqids array is generated as part of the digest algorithms, it is not "mandatory" in the sense we have discussed it with names, but it will still always be there and can be used as unique identifiers to the sequence records. The names array then does not need to be mandatory or we can say that in case a names array is not provided, the generated seqid array will be used also for names. Since the lengths array can also be derived from the sequences, we are able provide a deterministic function from a list of sequences to a seqcol digests, as I requested above.
We have a simple solution to generating the unordered digests I have argued for, namely to sort the arrays by the seqids array prior to calculating the top-level digest. It does not really matter much whether this is useful or not, as it will be simple to add this to the spec given my suggestion, and there are plenty of other arguments for my idea independent of the order issue. Thus, we can also provide a deterministic function from an unordered set of sequences to a seqcol digest.
Having globally unique identifiers for sequences (where equality comparisons are dependent on a particular template that standardizes the arrays and vocabularies) allows for network-style analysis, including semantic web searches etc, at the level of individual sequences.
As illustrated above in the BED analysis example, seqids are usable in all scenarios and file formats that already support a sequence name. There is also no loss in information replacing a sequence name with the seqid, as the reverse mapping is easily available.
The seqids might also be advantageous to use in the refget reverse lookup context (I have not followed this discussion closely). There will in any case be a one-to-many mapping between each refget digests and the corresponding (non-coordinate-system) seqid digests that might be taken advantage of in certain scenarios (I do not have any examples at the top of my head).
In the above writeup I at least aimed to add namespaces to all identifiers, so that they all follow the CURIE format. This adds compatibility with identifiers.org and n2t.net and is generally a good idea, I believe. This is of course independent of the rest of my idea, and I would in any case argue for providing CURIE identifiers. With the current idea CURIES makes it easier to combine different types of identifiers in a readable way. For simplicity and readability, I would thus argue that the namespaces should be always be included, also in the seqids array, even though the namespace part is redundant. I am of course always open for arguments in other directions. I believe this is also an issue where the input from the rest of GA4GH is relevant.
We have (at least theoretically) solved the chr1 vs 1 debacle once and for all!
11 arguments should be enough but if you want another one: the idea is somewhat elegant, isn't it?

EDIT: 13. I forgot to add that total duplications are not by definition possible with this idea.

nsheff commented 2 years ago

Thanks for writing this @sveinugu. I can't claim to follow your arguments completely, but I think I understand some of the main points. After reading through this and Rob's function suggestion above, I can add my thoughts:

My main conclusion is that: I think the canonical way to assign sequence names should be simply number them from 1 to X, and that's it.

My position comes from these basic points:

refget digests are unique, computable identifiers for sequences. When we have sequences and need unique identifiers, we should use refget digests.
'names' are human identifiers of sequences. they usually have some information, like chromosome number or something else, and are easy for a human to grok, unlike refget identifiers.
The cost of names being easy to grok and created for humans is that they may not be globally unique or deterministic. That's OK -- that's why we have the refget digests. the point of the names is for humans to refer to their sequences within a limited context, so they don't have to be globally unique. they just have make sense to the human who is using them. So the idea of two identifiers, one globally unique and computer-generated (refget digest), and the other locally unique (within a collection) and human oriented, makes sense to me.

I pose that names should be required to be unique within a collection, and should also be mandatory. In this case, the possible scenarios of what a user may have with names/sequences are:

We have sequences and names. This is no problem, the typical use case.
We have names but no sequences. This is the case of a coordinate system. Also no problem, since sequences are optional; names are required and must be unique, which is true of a coordinate system.
We have sequences but no names. Here's where we're having issues.
We have neither sequences nor names. Then, we have no sequence collection, right?

So, the issue is: what do you do when a human hasn't provided names for sequences?

Well:

I don't see the logic in creating another deterministic sequence identifier (seqid) to accommodate sequenceless entries. This is already accomplished by length, which is the only valuable information that would go into creating such an identifier anyway. This is a major complexity increase with, in my opinion, no benefit.
I don't understand the advantage of a function to generate unique identifiers based on sequence digests, because the point of the 'names' is that it is a human-friendly identifier.

I still think the best approach is to say "if you don't have names, you just number them" and that's it. In the event that a human has sequences but no names, then the names don't really matter. They are arbitrary and so we just number them to give them a unique identifier so the seqcol protocols can function correctly.

@sveinugu argues against this above here:

However, if one wants to create a name-invariant seqcol digest (if names are not available or are unimportant) and the order of the sequences are not given, one still needs to first order the sequences according to some deterministic algorithm.

But to me, this is missing the point. The sequences must be in some order. There's no such thing as a list of sequences that do not have an order, that's not the way computers or lists work. The seqcol is not handling order at all.

sveinugu commented 2 years ago

So I was asked in the meeting to provide a use case, and the best use case I have is my own. Even though this is a very specific use case, my thinking is that if it is possible to generalize the seqcol a bit in a principled manner to cater for my use case with no or only marginal costs to the existing functionality, I believe we would also probably cater for a range of other use cases that we haven't thought of yet.

So my core use case is in the context of developing tools for statistical analysis of BED-like files (which I call genomic track files). One of the more typical questions I would like to answer is this:

"Given two (or more) track files representing sets of intervals along a reference genome (say, two or more files of ChIP-seq peaks), are these sets of intervals co-localized (e.g. overlapping) more than expected by chance?"

The following Bioinformatics paper describes the implementational and methodological considerations for such type of investigations: https://doi.org/10.1093/bioinformatics/bty835

The most important consideration regards the question of what one means by "expected by chance", i.e. what is the assumed null model for the investigation? Statistical significance measures whether the observations of the data are more extreme than what is expected by the null model, so the realism of the assumptions that make up the null model are of uttermost importance. Too simplistic null models will result in a large number of false positives, where the statistical test will reject the null model and thus indicate a biologically relevant relation when there are none. In brief: If the assumptions that make up the null model are false (unrealistic), the statistical test is not answering the question we want it to answer.

So one type of statistical tests deployed to answer the question of co-localization are based upon randomization of the dataset(s) by shuffling the intervals to random positions along the coordinate system derived from the reference genome. So if the null model assumes uniform shuffling to all parts of the coordinate system, some of these randomized intervals will probably be moved to the centromere or other areas of the genome where they would not appear experimentally. Thus, the statistical test will in this case assume that some intervals "by chance" will appear in centromeres, while in fact all of the ChIP-seq peaks in the track data files by the very nature of the experimental and data processing methodology will be avoiding the centromeres. In this case, the ChIP-seq files will in fact be co-localizing (outside of the centromeres) more than expected by chance (which here assumes that peaks might as well appear in the centromeres as elsewhere). The null model will be correctly rejected by the test, however, the results are of no use. The only thing we have tested is, simply put, whether the data files contains ChIP-seq peaks, which is something we knew in advance.

So this long story is just to provide the background for why the gaps in the assembly (which includes the centromeres) are very important for this kind of analysis. Other restrictions on the randomization of the data should also be taken into account, but I will leave that out of this discussion and focus only on the assembly gaps. (Btw: @nsheff I believe one reason why we have differing views on the nature of coordinate systems is that your statistical methodologies, at least LOLA, are based on another way to restrict the null model, i.e. by having the user define a set of regions that define the universe for the analysis). The conclusion is this: If I want to represent the coordinate system as a sequence collection (and I do), such sequence collections would include at least the lengths and the assembly gaps, as well as some identifier (names, or possibly seqids if we decide to support this).

This represents an additional level of complexity compared to the way we have previously defined a coordinate system. For given that we know the exact sequence collection used to generate each file of ChIP-seq peaks, if these are all different patches of the same version of the reference genome (say: hg38), and even if we ignore the actual sequence contents, the assembly gap information would still differ (the patches are indeed filling gaps) for the same sequence across the seqcols defined by the data files. However, if we were to also include the the sequence array , the variation in seqcol content would be greater for each sequence across seqcols. To conclude: including only the assembly gaps represents a middle ground between a full sequence collection (that includes the sequence array) and a coordinate system as we have defined it (without the sequence array).

Furthermore, the coordinate system assumed by the null model should really be defined as a "largest common denominator" of the coordinate systems implied by each the files. One might, for instance, base the null model assumptions on a coordinate system with assembly gaps that represents a mean of the assembly gap information derived across all the files. Thus, the resulting coordinate system might be a "synthetic" one defined as a middle ground for analysis. In the same vein of thought, if the sequences differ in order across the per-file sequence collections, a "middle ground" coordinate system should have no order. Furthermore, if the names differ, the "middle ground" coordinate system should really not have any names, only name-invariant sequence identifiers.

Granted, there are ways of avoiding this complexity by e.g. specifying a canonical order or naming based on a specific sequence collection, but in that case the mathematically most correct solution will be unavailable due to limitations in the standard . Or, of course, one could decide to not follow the sequence collections standard at all. This will however limit adoption of the standard, in my case by one less developer.

In addition, specifying a particular sequence collection that defines canonical ordering or naming might have non-optimal consequences. For instance, in the domain, most BED-like files follow the non-standard UCSC-type names ('chr1' not '1'), even though the files themselves might have been generated using FASTA files based on standard naming. In such cases, one would probably as a data processing step carry out a string replacement operation to comply with the UCSC de facto standard. As a consequence, if a sequence collection identifier is attached to the file, it will end up out of sync with the actual names used in the file. Another alternative would be to define and use ad hoc "crossover" seqcols (where everything but the names stay the same), but these would probably not be registered in any seqcol repos. Lastly, if one wants to follow the sequence collection standard to the letter, one can opt for not carrying out such naming conversions in the data processing steps. However, this will reintroduce the need for the analysis tools to convert the names to follow a more or less arbitrarily selected canonical naming scheme (e.g. UCSC style). Avoiding such complexities was, I believe, one of the reasons why the UCSC-style names instead became the de facto standard at the level of data generation. As a conclusion, I believe it will greatly increase adoption of the standard in this domain if we provide a simple canonical way to automatically carry out such conversions (either between naming schemes or towards machine-operable identifiers, or both). Similar arguments can be made (and have been made by me previously) against canonical ordering, however I think the issues related to canonical naming are better at illustrating my concerns.

To summarize: I believe what my example mainly illustrates is that there will be many possible definitions of what is meant by a sequence in a sequence collection that are not covered by using the refget identifier to refer to a sequence. The advantage of the array-based solution is that it opens up for specifying whichever arrays are relevant in each context. However, providing limitations on ordering or naming will limit the landscape of potential applications that is opened up by the array solution. In my mind (and I'm sure you agree) I think we should weigh the benefits (as we percieve them) against the cost in functionality, complexity and/or time to overcome such limits in the standard itself. If the costs are very low, I believe we should by default move towards the most general standard.

  Based on the feedbacks from the meeting, I believe the consensus is that the costs are still too high, and I agree, but they are in my mind only slightly too high. So, I have an idea for how to further lower the costs and provide even more benefits, but that will have to wait until my next post.

sveinugu commented 2 years ago

So here’s the latest version of my idea:

First, here is a table-based representation of our current solution. In this example, I have added an assembly_gaps array with digests encoded according to some algorithm, as needed in the use case I described in my last post:

	names	lengths	sequences	assembly_gaps
seqcol:74dd74	d217f2	955d23	28b8b8	ffdd43
	chr1	150000000	refget:a39a48	gaps:d29f28
	chr2	120000000	refget:001f16	gaps:99f651
	alt123	100000	refget:a8f104	gaps:c5620c
	alt234	100000	refget:36bf12	gaps:c5620c

So one thing that can easily be done is to convert the array digests into array identifiers, let’s call them arrayids. The only thing needed to be done is to include the array names when calculating the array digests, so that two arrays with the same contents but different array names will not get the same id:

arrayid:5d10c1	arrayid:10f664	arrayid:fd047f	arrayid:771cd3
names	lengths	sequences	assembly_gaps
chr1	150000000	refget:a39a48	gaps:d29f28
chr2	120000000	refget:001f16	gaps:99f651
alt123	100000	refget:a8f104	gaps:c5620c
alt234	100000	refget:36bf12	gaps:c5620c

The arrayids will be globally unique, meaning that the same thing (array contents with a name) will always have the same id independently of in which seqcol it is used. In addition, the fact that the array names are unique within a seqcol (which is a technical requirement of our current solution) will make the arrayids also locally unique. If not, arrayids would have been globally, but not locally unique, and as such could not be used to identify a particular array within a seqcol (there could have been duplicate arrayids in a seqcol).

So similarly, one can easily also define row-wise globally unique identifiers as a digest of all the values in a row, including array names. Let's call these identifiers seqids, as above, but this time these identifiers are defined at the same meta-level as the arrayids, thus outside the table itself.

seqcol:f7a8c2	arrayid:5d10c1	arrayid:10f664	arrayid:fd047f	arrayid:771cd3
	names	lengths	sequences	assembly_gaps
seqid:64fa09	chr1	150000000	refget:a39a48	gaps:d29f28
seqid:fa0148	chr2	120000000	refget:001f16	gaps:99f651
seqid:3100f3	alt123	100000	refget:a8f104	gaps:c5620c
seqid:590f6a	alt234	100000	refget:36bf12	gaps:c5620c

Since these identifiers are globally unique, they can be deterministically sorted and thus give rise to order-invariant sequence collections (with unique identifiers).

In order for the seqids to also function as locally unique identifiers, one would need some additional constraint to make sure that totally duplicated rows are not allowed. Here is the constraint I suggest we require:

Totally duplicated rows are not allowed within a seqcol.

So, in contrast to my last suggestion, where I suggested we require that there should be no duplications excluding the names array, this new requirement will not pose any constraint on the majority of use cases, because:

The values of the names array must be unique, if present.

Thus, adding a names array will provide the uniqueness that is required by the first rule. As several people have argued: most sequence collections will have unique names. In most cases these two rules would represent the same level of constraints as a rule that a unique names array is mandatory.

However, formulating the constraints with these two rules in addition allows for name-invariant sequence collections by just removing the names array, given that this will not introduce any totally duplicated rows. This is the case in the current example, where removing the names array will still not create any totally duplicated rows (due to the sequences array):

seqcol:dd8a17	arrayid:10f664	arrayid:fd047f	arrayid:771cd3
	lengths	sequences	assembly_gaps
seqid:3f76aa	150000000	refget:a39a48	gaps:d29f28
seqid:880c11	120000000	refget:001f16	gaps:99f651
seqid:caa71d	100000	refget:a8f104	gaps:c5620c
seqid:633382	100000	refget:36bf12	gaps:c5620c

So the take-home message is this: This idea allows for globally and locally unique as well as fully configurable sequence identifiers (where the configuration is based on which arrays are included in the seqcol) without any practical negative consequences for the large majority of use cases. It also allows for both names-invariant and order-invariant sequence collection identifiers, based on the rule that there should be no totally duplicated rows in a seqcol. The idea also allows for globally and locally unique array identifiers, which is probably also useful.

sveinugu commented 2 years ago

Sidenote 1: on total duplications for egde case scenarios (added here for logical completeness, can be skipped):

The way I see it, for a seqidto be globally unique means that the same concept of a sequence will have the same identifier, independent of the seqcol within which it is found, or its placement within that seqcol.

If we from the last table want to remove also the sequences array, the last two rows will become total duplicates. In this case, the user will need to figure out a way to add uniqueness which is not dependent on the placement of the row within the seqcol. Adding an array with values that point to an external global reference point would solve the issue, as would removing all but one of the duplicated rows.

Hence, there exists a possible third requirement:

The values in a row must not be dependent on any arbitrary placement of the row within the sequence collection.

Here, dependency is meant in the statistical sense. If there are external variables that explains the values, which is the case for most sequence names (e.g. 'chr1' or '1' is the largest chromosome in the human genome), then the requirement still holds.

The main consequence of this rule is that about the only thing one should not do is this:

Given a set of unnamed sequences with no apparent order, one should not use the arbitrary index of the sequences as the default name, which was suggested above. This will ruin the global uniqueness and hence also the possibility for order-invariant seqcol ids (sorting of seqids will no longer be deterministic).

The natural solutions given one has a set of unordered sequences is either: 1) remove any duplicates and create a seqcol that is invariant of both names and order, or 2) create an external registry of names (or a FASTA file) and use those to define the names array. If this is done before creating the seqcol, then the duplicates are no longer "the same concept of a sequence" in the view of the seqcol standard.

(Edit, removed the possibility of canonical sorting, as this will not solve the problem. Total duplicates are per definition "the same concept of a sequence" and should always have the same identifier.)

However, as has been argued by others:

This is a very edge case scenario, and I don't believe it needs to enter into the standard itself. It is a problem for advanced users with use cases that are out of scope of the standard, and they will probably figure something out anyway.

sveinugu commented 2 years ago

Sidenotes on alternative seqcol digest algorithms that I do not propose (can definitely be skipped)

Sidenote 2a:

The suggested redefinition of arrayids to include array names has as a consequence that one could define the seqcol digest based on only the arrayids and not the array names. This will have the same degree of information content as the current definition. However, for convenience, I believe the seqcol digests should in any case also include the array names (as we are currently doing), even if these are already included as basis for the arrayid digests. The array names are in a sense redundant, but this redundancy also does not cost much and might be beneficial.

Sidenote 2b:

Given that seqids digests are now auto-calculated alongside the arrayids. If so, one could possibly define a seqcol digest only based on the seqids instead of the arrayids. However, this will introduce two different digests/identifiers of the same entity (one based on arrays/columns, the other based on rows). I see no need for this. The current array-based digests should fit the bill.

nsheff commented 2 years ago

I understand the use case above about statistics -- but, to me, this is solved by the gaps array, not by a redefining a row-wise identifier.

I'm trying to condense all this down into what you're actually recommending, and I think it's this: You advocate for:

names should be unique, but only if present
names array should not be mandatory

If we do it like that, this satisfies you, correct? Because it seems to me you can then build all the things you are talking about here on top of the seqcol implementation. So, the follow-up question is: are you sure there's no way for you to build what you describe under the constraint that names are mandatory?

Where you specified a need for nameless collections was here:

Furthermore, if the names differ, the "middle ground" coordinate system should really not have any names, only name-invariant sequence identifiers. ... Granted, there are ways of avoiding this complexity by e.g. specifying a canonical order or naming based on a specific sequence collection, but in that case the mathematically most correct solution will be unavailable due to limitations in the standard

I believe in this case you could simply name the sequences according to the way you've defined seqid. Create your row-wise sequence representations, including the length and the gaps identifier, and digest those to create the row-wise identifier seqid. Now, just set the names of the sequences to that. If I understand correctly, this solves the problem and also gives you the "mathematically most correct solution". It is also much simpler than the approach of defining seqid rownames or arrayid colnames, etc.

So, correct me if I'm wrong, but I believe I have understood your use case and your argument, and I still think it is possible to solve that even with maintaining the constraint that names be both unique and mandatory. I think given the computational benefits and simplifications we gain downstream from assuming names to be unique and present, it is worth enforcing this constraint.

nsheff commented 2 years ago

It strikes me that my proposal above is kind of a mix between what @daviesrob suggested and what @sveinugu suggested. Rob's suggestion was default names based on a the refget digest, but this fails if you lack a sequences array, and also doesn't satisfy the uniqueness constraints that sveinung needs. Sveinung's proposal generates unique row identifiers that satisfies his needs, but put them into a new concept of seqid instead of into the names array.

So, just combine the two -- generate the seqid, but don't put it in a separate column, just put it in the names array. After all, we've defined them as a "unique, mandatory identifier for a sequence" and isn't that exactly what the seqid you propose is?

sveinugu commented 2 years ago

It strikes me that my proposal above is kind of a mix between what @daviesrob suggested and what @sveinugu suggested. Rob's suggestion was default names based on a the refget digest, but this fails if you lack a sequences array, and also doesn't satisfy the uniqueness constraints that sveinung needs. Sveinung's proposal generates unique row identifiers that satisfies his needs, but put them into a new concept of seqid instead of into the names array.

So, just combine the two -- generate the seqid, but don't put it in a separate column, just put it in the names array. After all, we've defined them as a "unique, mandatory identifier for a sequence" and isn't that exactly what the seqid you propose is?

First of all, my idea is not anymore to put the seqid into a separate column, but to provide these in a higher level view at the same level as the current array digests (just row-wise instead of column-wise). So the columns/arrays would stay the same as you envision them. However, instead of adding a restriction on the names column so that names can be used as unique local identifiers, I suggest to make the standard even more general by placing the identifiers at the machine-interpretable level which is were I believe they belong. Then, we can keep the names array semantically clean as only providing human readable content. So basically we will be saying that in view of the standard, we don't care whether you name a sequence "chr1", "1", or don't provide a name at all. Instead we provide this mechanism for uniquely identifying a sequence in a machine-operable way, together with a mechanism for dynamically defining the scope of what you would like your identifiers to refer to (by adding/removing arrays). And if you really want to use the names as local identifiers, we will guarantee that they are unique if present.

Also, I might not have conveyed the full argument for my seqid suggestion, it is not just for my use case, but I think in general having a way to uniquely identify a row in a sequence collection would be highly useful in many other scenarios (not least within the implementations themselves). One such possibility is to use the seqid for naming conversion. Say that you want to convert from a "chr1..." based sequence collection to a "1..."-based. So given that the arrays are the same, you could easily create name-invariant sequence collections for both (by removing the names array). Then the seqids of these two "middle ground" sequence collections would match and could be used to map sequences from one collection to the other, independent of order and naming.

So following your arguments, you could say that one in that case could just define a new names column for these middle-ground sequence collections and it would technically work. However I feel we would then pass an opportunity to make the sequence collections generally useful to a range of use cases that we will not have a full overview of (but which I am sure we can brainstorm some examples of). What we would leave out is then:

General machine-operable mechanism for providing row-wise globally unique identifiers that can be configured in scope (by adding/removing arrays)
As such identifiers are part of the standard and must be implemented, one can depend on them downstream in a machine-operable way
These identifiers will be globally unique, while names identifiers will only be locally unique.
Very semantically clean way to provide name-invariant sequence collections (just leave the names array out)

I believe what is needed for the implementations to support this is simply to add an additional endpoint variant that provides the top-level view in a row-wise fashion with seqids, as an alternative to the current column-wise one. API-wise, this can be done by simply adding a parameter row_wise (or similar) to the top-level endpoint. The rest would be the same.

andrewyatz commented 2 years ago

I also might be mis-understanding the proposal here but I feel to a point there is a circular argument developing here since we come back to names and lengths being attributes of sequences which could be used as an identifier. In addition we now have gaps. As a genome provider I don't believe we actually know where those gaps are. This maybe back in the original AGPs but as we move more towards T2T based assemblies, these gaps are not going to be a unique identifier for elements within the collection.

Perhaps the bit I cannot quite fathom is:

One source of differences are the way providers have post-processed sequence upon submission
Another source of differences are the unique names assigned to sequences within a collection
Sequence order may also be different meaning this is another source of differences

As a user of a genome/sequence collection I can't control any of the above three sources of differences. Also I don't see how any system eliminates this when it is partly based on those major sources of differences. I also admit this might be me just not quite getting the thrust of the case for support for seqids or how they're being generated

nsheff commented 2 years ago

there is a circular argument developing here since we come back to names and lengths being attributes of sequences which could be used as an identifier.

Right... correct me if I'm wrong, but the seqid concept is in fact identical to the original idea that I called "Annotated Sequence Digests", which is actually how I originally implemented sequence collections. The term "annotated sequence digest" is meant to convey the difference from an original refget sequence digest, which digests the sequence only, while an Annotated Sequence Digest (ASD) digests the sequence with its annotations (other metadata columns).

Here's the ADR for where and why we decided not to do it that way: https://github.com/ga4gh/seqcol-spec/pull/14/files

I think @sveinugu is proposing to include both types of digests in the sequence collection, which is a bit different from the original discussion, where we were deciding on one or the other. It would basically combine the old and new ways of computing the digests. So, a level 1 secol in total would look something like this (ASDs == seqids):

{
  "lengths": "4925cdbd780a71e332d13145141863c1",
  "names": "ce04be1226e56f48da55b6c130d45b94",
  "sequences": "3b379221b4d6ea26da26cec571e5911c",
  "ASDs": ['ca1cd99...',   '22c090c...',  '426d38c...']
}

As Sveinung proposed, these could be served via different endpoints. Some thoughts:

The ASDs included like this will have the backwards compatibility problems that we were trying to avoid -- but only for the ASD entries, the array digests stay clean. Thus, this will only cause problems for analyses that require the ASDs. What this means is, if you add another array, all your ASDs change, but the array digests don't.
Including the ASDs like this adds no value to the uniqueness of the level 0 digest. In fact, including ASDs at this level will break the current algorithm because they are not singletons, but arrays, which is currently not allowed at this level. This brings up the point that these digests could be computed and stored, but not used to determine the level 0 digest. I would suggest that, if we include ASDs at level 1, we not use them in the calculation of the layer 0 digest. So, ASDs/seqids will need to become a special name at this level that is treated differently from the others.
Including ASDs will break the structure that each layer 1 element is a singleton, representing an array of the same length. This will break the nice property that the layer 1 object size is independent of the size of the collection. The 2-endpoints idea solves this; either making them mutually exclusive, as sveinung suggested, or having an ednpoint that returns the level 1 collection with ASDs and another that returns it without ASDs.
I believe most applications will not use these ASDs. But, it may not be too difficult to include them as an option for use cases that would use them.
I don't believe there would be any problem adding these ASDs on top of a basic sequence collections implementation for a use case that required them, provided we somehow allow for non-digest-affecting collection elements. But, this could all be handled outside the seqcol spec as an implementation detail, I think.

I think the big-picture point here is that we have to distinguish between these two things:

what elements and structure is used to compute the digests.
what elements and structure is available for retrieval via an endpoint.

These are not the same thing. For example, we once discussed removing the length from the digest calculation, because it doesn't confer any additional identifiability, given the sequences. We decided it makes more sense to keep it in, for consistency, because the sequence array is not always present. Even if we had decided to remove it from the digest calculation, the server could still store and return those values. In this case, we are facing a similar situation. From the perspective of the digest calculation, the ASDs add no uniqueness. The only value they could add is as a return value from the server, a reshaping of existing data. There are probably lots of other reshapings of existing data that could be useful for various specific downstream analyses. Should we build all of them into the spec? Or should we just make it possible to enact all of these things easily on top of the spec, for the use cases that would require them?

What would we need to do to enable these ASDs within the current spec?

Allow a server to return additional layer 1 elements that are not included in the digest algorithm. Actually, you could already do this.

The cost of making this a part of the specification would be:

required implementation of the ASD endpoint (or changing the seqcol endpoint to include ASDs)
required duplicate digest computation, since we have to digest them in both ways
required new tooling to differentiate non-digested vs digested elements stored with a seqcol
required specification for what property names are reserved for non-digested elements, or some way to specify that.

I do think this is useful. I do not know if it's worth making this a universal requirement. I also do not think requiring the names array will break the ability to do the ASDs. I think the implementation of ASDs can simply generate numeric indexes to satisfy the requirement for the underlying sequence collection, and then simply ignore the names array for ASD digests.

sveinugu commented 2 years ago

I also might be mis-understanding the proposal here but I feel to a point there is a circular argument developing here since we come back to names and lengths being attributes of sequences which could be used as an identifier. In addition we now have gaps. As a genome provider I don't believe we actually know where those gaps are. This maybe back in the original AGPs but as we move more towards T2T based assemblies, these gaps are not going to be a unique identifier for elements within the collection.

So I am not really competent when it comes to the technology, process and file/data formats related to assemblies. My probably naïve view of assembly gaps is in its simplest form the N characters of the sequence. So for simplicity, in my use case I would say that two coordinate systems (and sequences) are the same if they have the same Ns, without regards to the other characters.

sveinugu commented 2 years ago

Right... correct me if I'm wrong, but the seqid concept is in fact identical to the original idea that I called "Annotated Sequence Digests", which is actually how I originally implemented sequence collections. The term "annotated sequence digest" is meant to convey the difference from an original refget sequence digest, which digests the sequence only, while an Annotated Sequence Digest (ASD) digests the sequence with its annotations (other metadata columns).

Yes, I believe you are right. I was not part of those discussions, but I came to the same idea through a bunch of trial and error trying to provide solutions for some limitations of the array-based variant, namely the ability to provide order-invariant and now also name-invariant sequence collections. I also like the name “Annotated Sequence Digests” (which was new to me now).

Here's the ADR for where and why we decided not to do it that way: https://github.com/ga4gh/seqcol-spec/pull/14/files

I think @sveinugu is proposing to include both types of digests in the sequence collection, which is a bit different from the original discussion, where we were deciding on one or the other. It would basically combine the old and new ways of computing the digests. So, a level 1 secol in total would look something like this (ASDs == seqids):
{
  "lengths": "4925cdbd780a71e332d13145141863c1",
  "names": "ce04be1226e56f48da55b6c130d45b94",
  "sequences": "3b379221b4d6ea26da26cec571e5911c",
  "ASDs": ['ca1cd99...',   '22c090c...',  '426d38c...']
}

So my proposal would be to add an additional, independent object which is just the array of the ASD, which will take no part in the existing structure.

As Sveinung proposed, these could be served via different endpoints. Some thoughts:

The ASDs included like this will have the backwards compatibility problems that we were trying to avoid -- but only for the ASD entries, the array digests stay clean. Thus, this will only cause problems for analyses that require the ASDs. What this means is, if you add another array, all your ASDs change, but the array digests don't.

I agree, but the way I think of that is that it is really an advantage. Adding an array is really redefining the conceptual model of what you consider to be an Annotated Sequence, and thus it’s digest should change correspondingly. Thinking of the digest as a globally unique identifier, if the concept of a sequence that you refer to change, the identifier should also change to reflect that. For instance removing the names array will generate a concept of the sequence which is dependent only on its functional characteristics, and the ASDs of such a collection will be machine-operable identifiers that can function as a replacement of the humanly readable, but error-prone, "chr1, chr2,…” or “1, 2,…” names.

Including the ASDs like this adds no value to the uniqueness of the level 0 digest. In fact, including ASDs at this level will break the current algorithm because they are not singletons, but arrays, which is currently not allowed at this level. This brings up the point that these digests could be computed and stored, but not used to determine the level 0 digest. I would suggest that, if we include ASDs at level 1, we not use them in the calculation of the layer 0 digest. So, ASDs/seqids will need to become a special name at this level that is treated differently from the others.

Or not include them in this structure at all, but instead provide an alternative and independent object/endpoint.

Including ASDs will break the structure that each layer 1 element is a singleton, representing an array of the same length. This will break the nice property that the layer 1 object size is independent of the size of the collection. The 2-endpoints idea solves this; either making them mutually exclusive, as sveinung suggested, or having an ednpoint that returns the level 1 collection with ASDs and another that returns it without ASDs.

I believe most applications will not use these ASDs. But, it may not be too difficult to include them as an option for use cases that would use them.

I agree.

I don't believe there would be any problem adding these ASDs on top of a basic sequence collections implementation for a use case that required them, provided we somehow allow for non-digest-affecting collection elements. But, this could all be handled outside the seqcol spec as an implementation detail, I think.

I think the big-picture point here is that we have to distinguish between these two things:

what elements and structure is used to compute the digests.

what elements and structure is available for retrieval via an endpoint.

I agree. My idea is to add the ASDs as an endpoint output only (2). But a mandatory one. The reason for this requirement is that, if we assume most reference genome providers will adopt the sequence collection standard, then having the ASDs also available from all of them will open up use cases to researchers and developers that will not available otherwise, even if one designs one own functionality on top of the sequence collections standard (unless one manages to convince all of them to implement the same additional functionality). So for me, this is a pragmatic issue more than a theoretical one. So one can easily implement search/indexing tools externally, but only if these ASD identifiers are available from the databases. Otherwise, one would need to replicate the full database contents in order to calculate these digests.

(snip) There are probably lots of other reshapings of existing data that could be useful for various specific downstream analyses. Should we build all of them into the spec? Or should we just make it possible to enact all of these things easily on top of the spec, for the use cases that would require them?

I believe the row-wise digests is in a special category and not comparable to “lots of other reshapings”. It is the main logical alternative to the array-based digests.

What would we need to do to enable these ASDs within the current spec?

Allow a server to return additional layer 1 elements that are not included in the digest algorithm. Actually, you could already do this.

The cost of making this a part of the specification would be:

required implementation of the ASD endpoint (or changing the seqcol endpoint to include ASDs)

required duplicate digest computation, since we have to digest them in both ways

I agree.

required new tooling to differentiate non-digested vs digested elements stored with a seqcol

required specification for what property names are reserved for non-digested elements, or some way to specify that.

I still don’t get this part.

I do think this is useful. I do not know if it's worth making this a universal requirement. I also do not think requiring the names array will break the ability to do the ASDs. I think the implementation of ASDs can simply generate numeric indexes to satisfy the requirement for the underlying sequence collection, and then simply ignore the names array for ASD digests.

Using numeric indexes as ‘names’ makes the ASD dependent on the order of the sequence within the collection, which breaks the possibility to use this to identify equality of Annotated Sequences between collections. Instead I propose that names are not mandatory, but that the ASDs are, so one could always use ASDs to uniquely identify a row within a sequence.

EDIT: i misread you. I see now that you explicitly suggest to ignore the names array when generating the ASDs. I believe there is value to both including and excluding the names array when generating ASDs, which is why I suggest that ASDs are always created from all arrays, but that the sequence collection standard also allows the removal of the names array altogether.

nsheff commented 1 year ago

Defined now in: https://github.com/ga4gh/seqcol-spec/issues/50

nsheff commented 1 year ago

Now that we have the inherent attribute, we have defined on a per-schema basis, which attributes should be included in the string to digest.