Closed nsheff closed 6 months ago
One comment that came up in our discussion (thanks @jmarshall) is that perhaps we should relax the constraint of requirement for name, length, and topology.
I originally had everything required, and then dropped 'sequence digest' from required because I had a use case where it would be useful, so I guess the others are still listed as required simply because that's how I started. But now as I think about it more, I think this is right; there are potentially useful reasons to make a "sequence collection" that has nothing but sequence names, or a set of sequences without names, or something like that, and then you'd be able to use just some subset of seqcol functionality that only requires some aspects of the annotation.
So, I think it's right that going forward, there's no reason we can't just make everything optional.
During the meeting I was questioning the reasoning behind including the topology in the digest. Basic argument is that regardless of the topology stated in the collection the representation of the sequence is still linear and the topology is an attribute of biological sequence that could be held in the metadata rather than the digest.
@rishidev pointed to minutes from last year (2019) meeting which discuss this issue. One point seems to be that the adding the topology in the digest would enable the "recording" that a sequence was analysed as a circular sequence and so it should be recorded in the sequence collection associated. To me this seem like an analysis metadata though
@yfarjoun @andrewyatz Can you recollect the rational behind adding topology in the digest?
So, I think it's right that going forward, there's no reason we can't just make everything optional.
A side question is: when a field is missing (or empty) does it have to be empty in all the elements of the collection? For example: can I have a collection where the sequence digest is missing in one or a few of the elements?
The issue was (and still is) that the same sequence could be included in a sequence dictionary in either linear or circular topology, but the choice can affect how downstream tools will understand the data.
Originally we thought of putting it into the sequence (refget) itself, but that was deemed inappropriate since it was "metadata" not information about the sequence itself.
Thus, in order to be able to distinguish a dictionary that has MT/linear from MT/circular, we have to include the topology somewhere...it could be in the sequence collection, or it could be in an additional layer that embeds and annotates a sequence collection... we could include an annotation wrapper that embellishes the collection and its elements with arbitrary tags. That would be the most flexible solution and it would not make the topology a special case.
Y.
On Wed, Dec 9, 2020 at 11:31 AM Timothee Cezard notifications@github.com wrote:
During the meeting I was questioning the reasoning behind including the topology in the digest. Basic argument is that regardless of the topology stated in the collection the representation of the sequence is still linear and the topology is an attribute of biological sequence that could be held in the metadata rather than the digest.
@rishidev https://github.com/rishidev pointed to minutes from last year (2019) meeting https://docs.google.com/document/d/1h1KauiREYLhJXW-J-vMeb0Fu2CKasrwCkb55XkB4J2U/edit# which discuss this issue. One point seems to be that the adding the topology in the digest would enable the "recording" that a sequence was analysed as a circular sequence and so it should be recorded in the sequence collection associated. To me this seem like an analysis metadata though
@yfarjoun https://github.com/yfarjoun @andrewyatz https://github.com/andrewyatz Can you recollect the rational behind adding topology in the digest?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/seqcol-spec/issues/8#issuecomment-741888828, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUQVPGQZQVUVKC3SFW3ST6Q6LANCNFSM4UTR2OCA .
So, I think it's right that going forward, there's no reason we can't just make everything optional.
A side question is: when a field is missing (or empty) does it have to be empty in all the elements of the collection? For example: can I have a collection where the sequence digest is missing in one or a few of the elements?
I believe that every sequence should have all four elements we've indicated here, or to be more specific it'll have a name, length, checksum and topology is assumed to be linear unless specified otherwise.
One possible use case for sequence-less seqcolls:
Say you only have the chromosome and length information for a bunch of reference genomes, which would be the case for instance in a default installation of Galaxy. I think it would then make sense to create a database of these, and making use of seqcol API mostly for the comparison functionality.
Generally, I believe, in line with @nsheff that making elements optional would open up for a range of functionality that would be useful for various "real-life" challenges, especially for issues related to track analysis, which both @nsheff and myself have experience with (e.g. https://doi.org/10.1093/nar/gky474).
As to the issue of topology (and other aspects discussed here), I think we really need to provide a clear definition of what a sequence collection is. I see at least two directions here:
A sequence collection is just that: Named and ordered sequences (i.e. character arrays) collected together, and not much more. From this point of view, topology is metadata.
A sequence collection is a model used to represent reference genomes and other very similar concepts that are typically represented by a collection of named ordered sequences. From this point of view, topology is a central aspect, as a linear and a circular version of the same sequence are different models.
Also the second way of thinking could extend towards graph genomes, at least to a set of sequences describing a sample set of personal genomes from which a graph genome can be generated. As mentioned elsewhere in this project (cannot remember where), GRCh38 does really fall into this category due to the alternative sequences. Under this view, a seqcol might also cover a set of coordinate systems without sequences. The obvious difficulty then is to decide on where to draw the line of what is in scope or not.
I personally believe it moves towards the second point of view. That topology is important.
Also the role this will play with graph genomes is an interesting one. This for me depends on what the unit of work of a graph genome will be. Do we want to refer to common builds/paths through a graph of haplotypes or something deeper? I think those are open questions for the moment and this standard looks to resolve the existing issues with reference genomes which will remain linear for some time to come & continue to play a part in many clinical diagnostic pipelines
We've just had a lot more discussion on this point in the refget call. We have gone back and forth; in the end I think the consensus at the moment leaning toward option 1, that is, not including topology as a fundamental to a sequence collection, and then building the topology metadata information on top of that as an extra layer of information. But, we are still having some debate and we see compelling arguments in both directions.
Arguments for not including topology in the digest are:
Arguments for including topology in the digest are:
What other information is there that is of the same class as topology? One thing that came up is a sequence mask. A mask is not inherent in a sequence, but annotates a sequence. The difference here is that a mask annotates individual elements in the sequence and can be represented by regions, for example; the topology is an annotation of the entire sequence.
But another thing that annotates a primary sequence is "primary" vs "secondary". We previously discussed annotating sequence collections with a way to flag which elements are core elements, like the assembled chromosomes in the human genome, and which are secondary, like unassigned contigs or other random sequences. This feels like the same type of information as topology.
Here's a proposed solution
Latest solution to this gets my vote.
We've just had a lot more discussion on this point in the refget call. We have gone back and forth; in the end I think the consensus at the moment leaning toward option 1, that is, not including topology as a fundamental to a sequence collection, and then building the topology metadata information on top of that as an extra layer of information. But, we are still having some debate and we see compelling arguments in both directions.
Arguments for not including topology in the digest are:
- topology is not inherent to a sequence; the sequence itself in either linear or circular form is the same, at least as stored in a computer. The digests are really describing the sequences as we store them in the computer.
I guess this comes down to what the goal of seqcoll is: Are we trying to just describe the sequences as they are currently stored, or do we try to improve upon the current state? As a latecomer, I am not exactly sure of what the goal is, but personally I would think improving on the current state is a worthy goal. So a scenario (which is not very far-fetched) is that some analysis system that do handle circular and linear genomes differently (which all really should) would be able to, from a bunch of digests, be able to grab the correct sequences and preprocess them according to topology, without knowing what they represent in terms of biology. So I would say topology should have been a part of the sequence info from the get-go (e.g. encoded in FASTA), and that we now have a possibility to improve on that lack.
- Is there a real-world case where you'd have a linear and a circular version of the same sequence? MtDNA for example is always circular.
Probably not, but I think a more relevant scenario is the one I mention, as the topology information is crucial for managing the sequence correctly, even though many tools don't care about this. Without this, one needs to add more command-line parameter to the tool, and I think we already have plenty of those.
I do however think that we could try to make such a parameter more abstract, in order to support later extensions, such as "sequence_type" or similar.
- the 'circular' vs 'linear' is a function of a particular analysis, not of a sequence. It is surely useful to annotate sequences with biological information like topology for a particular analysis case, but why would you want that inherent in the digest of the sequence collection.
I disagree. I don't think it is a function of the particular analysis. However, it is not a function of the sequence if by sequence you mean the contents of a FASTA file. However it is a function of the mathematical model of a DNA (or whatever) sequence, which is really what you need for analysis.
Arguments for including topology in the digest are:
- SAM headers include topology as a field. So, it would be nice to be able to query a server with a digest, and get all the required information back -- including topology. If topology is included as "metadata layer", it could still be returned by the server, but then you're not guaranteed to get the same thing back for a particular digest from different servers, because the topology information was not included in the digest.
The centrality of SAM to all things in bioinformatics makes this a heavy argument in my mind.
Other metadata like topology?
What other information is there that is of the same class as topology? One thing that came up is a sequence mask. A mask is not inherent in a sequence, but annotates a sequence. The difference here is that a mask annotates individual elements in the sequence and can be represented by regions, for example; the topology is an annotation of the entire sequence.
I'm a bit uncertain what a sequence mask would represent here. I assume it is not the same as the Ns in the sequence (hard mask), but perhaps soft masks, like what was done with repeating regions (https://www.biostars.org/p/98042/)? So the repeating regions is clearly metadata annotation (a track in itself), while the hard mask is a part of the sequence information (I think we all agree on including N's in the sequences). So I would say that topology here is at the level of the hard masks, which we do include.
A side-note of relevance: One of the main issues creating lots of false positives in the analysis of track files is that most common formats (except GFF in theory, but not in practice) do not include mask-type information about which regions of the DNA the track is defined. Most tracks are e.g. not defined in N-rich regions (e.g. centromeres) and other regions with a high number of short repeats. It is very different that a lack of a region is due do technical reasons (e.g. that the current analysis window is within the centromere) or that the lack of a region represents a real lack of some genomic feature. So this is essential information about the track file that should always have been included (in addition to having an identifier of the reference genome of course). It can be estimated from the reference genome, but only the creator of the track would know if there are other constraints that would have made it impossible for a region to appear in certain areas. See e.g. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2438-1.
So I would argue that this is the similar type of scenario where essential information is lacking from the sequence files due to some historical judgements by someone. Topology is one of these types. Primary/secondary is another, as you discuss below:
But another thing that annotates a primary sequence is "primary" vs "secondary". We previously discussed annotating sequence collections with a way to flag which elements are core elements, like the assembled chromosomes in the human genome, and which are secondary, like unassigned contigs or other random sequences. This feels like the same type of information as topology.
So what about sequences of other types than DNA? I assume we are supporting those also, but would it not be a point to know which kind of sequence we are talking about. But instead of having an ever-increasing list of terms for this, one could e.g. denote the possible characters used in the sequence, e.g. acgnt (sorted order) for basic DNA sequences, but there are plenty of other options out there: https://en.wikipedia.org/wiki/Nucleic_acid_notation. This string would be very different for protein sequences (if that is to be supported). It would be extremely useful for a parser to be able to know the character set in advance to check if one can handle the sequence. Also, in case a specific set of characters is in theory supported, the lack of use of those in the sequence does also carry information (similar to the track example above).
EDIT: GFF only supports leaving out bits in the beginning and end, not in the middle of sequences, but I have anyway never seen this used.
In discussion at the refget meeting, we came up with a few new insights on this. For example, following the proposal above, if we include topology outside of the core sequence collections as a standalone array... this leads to the idea of encoding the names, sequences, and lengths as arrays as well, instead of as properties of a sequence, bundled together. This has a few nice advantages, and some disadvantages as well.
I think a useful way to conceptualize the question is, what is the schema for data included. We considered a few possible schemas. Here I'll try to summarize, it's easiest to show with a simple example of a sequence collection with 3 sequences.
Here, we use an array of objects, each with 3 properties: name, length, sequence. This is the way I've been thinking about it so far.
seqcol = [{'name': 'chrUn_KI270742v1', 'length': '186739', 'sequence': '2f31c013a4a8301deb8ab7ed1ca1cd99'},
{'name': 'chrUn_GL000216v2', 'length': '176608', 'sequence': '725009a7e3f5b78752b68afa922c090c'},
{'name': 'chrUn_GL000218v1', 'length': '161147', 'sequence': '1d708b54644c26c7e01c2dad5426d38c'}]
A new proposal is, we can just think of it as 1 object, with 3 properties: an array of names, array of lengths, and array of sequences.
seqcol = {'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2', 'chrUn_GL000218v1'],
'lengths': ['186739', '176608', '161147'],
'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99', '725009a7e3f5b78752b68afa922c090c',
'1d708b54644c26c7e01c2dad5426d38c']}
This leads to the possibility of digesting these arrays separately. If we digest each one separately it looks like:
{ 'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43',
}
Now, this is kind of nice because we get a sequence-only digest, and a name-only digest, etc. Maybe we would only digest the sequences, and not the names/lengths? Or maybe none of them.
But regardless of whether we digest each array separately, another advantage of using separate arrays is that it seems to be nice for fitting in more information, like topology. So, for example:
{'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2', 'chrUn_GL000218v1'],
'lengths': ['186739', '176608', '161147'],
'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99', '725009a7e3f5b78752b68afa922c090c',
'1d708b54644c26c7e01c2dad5426d38c'],
'topologies': ['linear', 'linear', 'linear'],
'priorities': ['primary', 'primary', 'secondary']}
This feels clean; the seqcol is just a bunch of named arrays, and you use whichever ones you need. Each array must be the same length. If we digest each option individually and then digest them together, then for example, one seqcol in "flat form" could be:
{'topologies': '',
'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43',
'priorities': ''}
Another could be:
{'topologies': '951c378c80dd043fe97bbb44c22bd7afb6931573905e3a13',
'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43',
'priorities': ''}
We can immediately tell they have the same sequences/lengths/names, but one has topologies and the other doesn't. And of course, we can recurse to get whatever layer of information we need.
This seems appealing to me, but there are a few issues:
A solution to problem 1, we can introduce a new structural layer, say, rawseqcol
, that holds just the "Big 3": sequences, lengths, and names:
{'rawseqcol': '{'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
'lengths': '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
'names': '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43'}'
'topologies': '951c378c80dd043fe97bbb44c22bd7afb6931573905e3a13',
'priorities': ''}
In just 1 layer it looks like:
{'rawseqcol': '5655f27ed3dd7b8bb7c108c4ddc562979f82bc294877988c',
'topologies': '951c378c80dd043fe97bbb44c22bd7afb6931573905e3a13',
'priorities': ''}
So, this adds the complexity of an additional layer in the object, but gives the ability to use that 'rawseqcol' digest independently, if you don't want an 'annotated seqcol' with all the other stuff. So now, you'd end up with a checksum for the raw sequence collection (without topologies), and this is what I already implemented in a demo server.
But this doesn't solve the second problem above. How do we define the metadata that sits alongside topology, like 'priorities', and 'masks' and things like this? Where does it stop? is it fixed? An alternative could be to allow arbitrary "tags" with data, so, it would look something like this. At the top level is a split into seqcol
and annotation
objects:
{'rawseqcol': '5655f27ed3dd7b8bb7c108c4ddc562979f82bc294877988c', 'annotation': '9cxkw...' }
The annotation object groups all that "other stuff" (outside the "Big 3"), and looks like:
{'annotation': [{'name': 'topology', 'value': ['linear', 'linear', 'linear']},
{'name': 'priority', 'value': ['primary', 'primary', 'primary']}]
It's now flexible, so 'topologies' is not hard-coded into the schema...but you can provide it by specifying the 'name'. So you have to name your object as a variable ("topology", "priority", etc), and then provide its value, instead of hard-coding those possibilities in the schema. So, this is kind of nice because it can be used for any sort of future annotation. But -- it also explodes the possibilities, so now you can make a checksum for whatever kind of notations about your sequence you want, and now we've lost a chance to standardize and nobody's digests will match. It also makes it harder to define the checksum algorithm, since somehow that identifier will need to be included in the digest, instead of just inferred. So, this is starting to really complicate things, and I'm not sure it's worth it.
After thinking through these options for several hours, I come back to the core question: what elements are part of the identity of the sequence collection? I think all the things we've discussed here are valuable, and @sveinugu raised some really good points about how important topology and masks and priorities are. But we need to separate out the components that belong in the digest from those that just annotate that digest. This seems like the critical question: what elements are part of the identity of the sequence collection? Because the rest of it, while valuable, seems to me like it should be included somewhere, but does it really need to be digested? That's what we're trying to decide here; not general importance, just what belongs in the unique, global identifier.
I think we don't need to make "Sequence Collections" solve every problem. It can be a very simple thing, that will be easy to implement, easy to use, and very universal... and easy to extend with whatever other information eventually makes sense to extend with. Then we or others can build things that extend it. So, I think it's more important that it be extensible, simple, and re-usable than comprehensive.
On one hand, name
is not much different from topology
. The name of the sequence is not inherent in the sequence itself, either. So, what's the rationale for including it? Well, I think the name is really something separate -- it's an identifier of a sequence. So, it seems appropriate to include on those grounds as an identifier. Furthermore, a huge number of analyses critically depend on names, and if they differ, then the analysis results differ. In a sense, by including name, we've already lost the "only include things inherent in a sequence" argument, and we've conceded that sequence collections are not a pure abstract representation of data, but reflect analysis use cases. In that sense, the only argument against including topology is that its use is nowhere near as widespread or as important as name, and it's not part of original fasta format, so we often lack that information.
There's an argument that lengths shouldn't be included in the digest, either, since they're redundant with sequences. Providing them is one thing, but we could technically provide them in the API without including them in the digest. Given a set of sequences, the lengths are fixed, so why include them? Well, one reason is that we want sequence collections to be able to represent "Chromsizes" files, or coordinate systems, which consist of just names and lengths (no sequences). So, in this case, you'd have to include the lengths in the digest, because there are no sequences. To make a single digest algorithm be able to handle both 'chromsizes', as well as actual sequences, it seems to make sense to just require lengths to always be included. The alternative would be to come up with a separate digesting algorithm for coordinate systems...
In that case, maybe a coordinate system is names+lengths, and keep sequences separate:
{
'coordsys': {'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2', 'chrUn_GL000218v1'],
'lengths': ['186739', '176608', '161147']}
'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99','725009a7e3f5b78752b68afa922c090c',
'1d708b54644c26c7e01c2dad5426d38c']
}
Now, we can digest the sequences separately. I suppose this is a possibility. Then, maybe our primary digests could be built from a tuple of (coordsys, seqcol)
. When I think about it this way, topologies seems to fit in better into the 'coordinate systems' data model, and then I might do something like:
coordsys:
names: [array]
lengths: [array]
topologies: [array]
priorities: [array]
sequences: [array]
I'm still torn on this one. I think sequence names are practically widespread and useful enough to belong in the digest, but on topologies/priorities, I'm not really sold that they are practically useful or widespread enough to justify including in the core definition. These can be handled outside sequence collections. The advantage I see to excluding them is that in our core seqcol identifier set, there will be fewer canonical sequence collections, and so they will be more reusable, because we won't have the problem of "the identifier with the topologies included vs the one without". Because, as great as it would be if all seqcols included the proper topologies, many people will not bother, and the data doesn't always even exist, leading to multiple sets of near-equivalent identifiers in widespread use (which is basically where we are now). But -- including them means they can be used, at least, and maybe that opens some new use cases, without too much of a cost. So, I don't know.
If we include topologies, I don't see why we can't also include priorities.
If we include priorities, I want to include masks. But masks feel like a much bigger issue, and I'm afraid this causes the problem to balloon out of control. Therefore, I want to set the cutoff much smaller, and then deal with these other issues in the next project.
I like the simplicity of the independent arrays. Therefore, I would probably choose representing a sequence collection like this:
seqcol = {'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2', 'chrUn_GL000218v1'],
'lengths': ['186739', '176608', '161147'],
'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99','725009a7e3f5b78752b68afa922c090c',
'1d708b54644c26c7e01c2dad5426d38c']}
lengths
is required.names
and sequences
are optional. topologies
and priorities
-- maybe.I would probably not re-digest everything, so we create the string-to-digest by concatenated array items with one delimiter, and then arrays with another. The order would be lengths;names;sequencedigests, so for example (here shown with arbitrary delimiters):
186739,176608,161147;chrUn_KI270742v1,chrUn_GL000216v2,chrUn_GL000218v1;2f31c013a4a8301deb8ab7ed1ca1cd99,725009a7e3f5b78752b68afa922c090c,1d708b54644c26c7e01c2dad5426d38c
Side note: one nice thing about digesting each one independently is that we only need 1 delimiter.
Then, we start another, separate project surrounding extending that unit with additional information (topologies/priorities if not included, and masks, or anything else).
To add a few items from the refget meeting today
Broad agreement that this is the right way to approach the problem and offers the most flexibility to ourselves and implementors to bring in new data concerning sequences (considering the list has grown to include topologies and priorities and masks it's likely this will continue to grow).
A sliding scale was proposed of attribute importance roughly running like
A few solutions were suggested but concerns were raised over the absence/presence of key fields and the flexibility/inflexibility of the digest i.e. how do we ensure digests do not need to be recalculated all the time. Suggests to avoid this included
Here's an approach. Build a string-to-digest like this:
3,lengths,152,252,424,names,chr1,chr2,chr3,sequences,bc90a07cf2,aee3d420,72c915234c
The first number is the number of items in the array. All arrays must have that number of items. Before each delimited set of items is the identifier for the category (lengths, names, sequences). These must be in alphabetical order.
Now to make an identifier without sequence digests, you'd do:
3,lengths,152,252,424,names,chr1,chr2,chr3
To add in topology, you'd do:
3,lengths,152,252,424,names,chr1,chr2,chr3,topologies,linear,linear,linear
The advantages of this approach is that it's flexible with respect to what we can put in the string-to-digest, so it's backwards compatible if we add something in the future, since there's no affect on the digest to leave one of the possible arrays out.
With a second delimiter, we could avoid the need to put the number of items in the array. This is a more flexible approach as it would allow arrays to contain different numbers of items. Maybe not necessary for this particular use case, but it makes it more universal.
lengths,152,252,424>names,chr1,chr2,chr3>topologies,linear,linear,linear
or
lengths>152>252>424,names>chr1>chr2>chr3,topologies>linear>linear>linear
Couple of questions or comments
Some comments after the last meeting (which seems to have turned into more of a novel!):
@nsheff I really like the recursive seqcol implementation, and I believe we are almost there now.
Also, I think I agree on not including a string to signify which arrays are included (such as 'lns', for lenghts, names, sequences) in the top-level identifier. For any use-cases that would need this, it should be stored alongside the identifier as metadata.
@tcezard. In my mind, the important question regarding which arrays to support in the standard is whether it makes sense to include such information when using sequence collections digests as identifiers in some sort of applied service. specifying which arrays to include makes it possible to tailor the uniqueness property of the digests to a particular use case. A simple way to think of such use is to ask whether it makes sense to sort some kind of data/metadata into buckets defined by a subset of the possible arrays (i.e., that all the data linked to sequence collections with identical digest, as defined by a set of arrays, are put in the same bucket).
Example: As has been argued previously, one natural application would be to use seqcols to replace 'chromlen' files for track analysis (simple tabular files that include the sequence names and lengths). The current implementation would remove the need for such files completely, as their content would be directly available from a single API call (recursion level 1). For such type of use, one could with the current idea be able to put all track files made from a single major version of the human reference genome (e.g. GRCh38) into one bucket, regardless of the patch version used to generate the track. This is because patch updates only update the sequence content but not the chromosome coordinates, i.e. names and lengths). This is how the genome reference metadata fields are typically categorized today, for instance in genome browsers.
EDIT: On further thought, I don't think the above is actually true. I believe some patches add alternative sequences. Also, placing previously unplaced sequences within a chromosome will remove those unplaced sequences from the full list, thus changing the digest. So in order to provide the required buckets using seqcol, one would need to define collections with only the core chromosomes. This is what is really meant by specifying e.g. "GRCh38" as the reference genome of a track file, as one typically does not care about the non-chromosome sequences (except perhaps MT). Supporting track files is something I believe is a very important use case for seqcol. It is also relevant for this (more or less randomly selected) new publication: https://f1000research.com/articles/10-268. My complete suggestion below would solve this issue.
A. sequence
This is obviously essential. Being able to bucket data/metadata according to the sequence content, regardless of the sequence names, is obviously useful, especially in the context of the UCSC vs. rest of the world divide on chromosome names. We can probably come up with many examples of this, and you probably already have (I did not double-check with the use case document ).
B. names + lengths
I have already explained the usefulness of this for track file analysis.
C. sequence + names (+ length)
This would be the most common way of bucketing e.g. reference genomes. Adding lengths (or other directly sequence-derived arrays) would not change the bucketing ability of the digest, other than defining a second set of buckets with the same properties (the only difference being whether lengths were added to the digest calculation or not). In such a context, being able to easily see the difference between the two types directly from the digest might be a plus, but one could also simply store this info as metadata, filter out one of the digest types as part of data import, etc. The same logic applies to the sequence + lengths option.
D. names (labels?) only
I feel that there should be some use cases for this, but I have a hard time coming up with scenarios where adding either the sequences or the lengths (or both) would not be a better choice. (Btw: should we rather refer to names as "labels"?).
E. lengths only
Don't think so.
A-E. core rules
From the above considerations of the 3 most "core" arrays, we could define the following two core rules:
i) Either sequences
or names (labels?)
or both must be present
ii) lengths
must always be present. (This would remove the "double bucketing" issue for at least the three core arrays.)
Rule i) and ii) results in 3 possible combinations of core arrays.
Rule ii) also has as a consequence that one would always be able to prepare for a full sequence download at the level of the first recursion level, for instance by preparing some data structures at the appropriate lengths, or selecting API for sequence retrieval based on the size (e.g. allowing small sequence downloads from the seqcoll API, but switching to another API for larger sequences).
Assuming these two core rules, let's continue with the other possible arrays (which would then never be found without the core arrays, something which makes total sense to me):
F. topologies
So we have discussed this back and forth for a long time. There are scenarios where such information is crucial for an algorithm, but one could also in those cases just require topology to be added as additional metadata. As topology is unlikely to change in different versions of the same genome reference (other than such information being present or not), it might not seem very useful to add topology (in addition to the core arrays) when defining the buckets. It will only work as a flag on whether topology info is being taken into consideration or not. However: I believe adding the possibility will not hurt much in the current implementation idea and it will have as a consequence that topology information would become more easily available and would thus, in practice, also be used more for algorithms/services that can make use of it (assuming that adding topology info is often done through an optional tool parameter that most people just skip, which I believe is a reasonable assumption). Hence, just including topology as an optional array in the standard would most probably result in a slight increase in overall data/result quality in the life science field, which is not a bad argument, IMO.
G. priorities (sequence groups?)
I believe this would be same/similar to what in the NCBI assembly database is referred to as "Sequence role", in combination with "Assembly Unit", see e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_assembly_report.txt (from https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26). This basically represents a subdivision of assembly sequences into smaller groups:
In e.g. the human genome, the assembled-molecule
/Primary Assembly
combo contains the chromosomes, the assembled-molecule
/non-nuclear
combo contains the MT sequence, the alt-scaffold
role contains the alternate sequences, and the unplaced-scaffold
and unlocalized-scaffold
roles are the unassembled sequences.
I believe such subdivisions are important information to have available in a machine-readable way, and having this information in a seqcol array is an easy way to support this. However, for identifier-based bucketing, adding such a subgroup category array would only work as a flag on whether such information is available or not, across all sequences, which is of limited use. In contrast with the topologies array, adding the subgroup info to a sequence would probably not change how that sequence is handled by itself. The most natural usage scenario is to make use of such subgroups for filtering. However, in practice this is typically done prior to analysis, so that the sequence collection used in the analysis step would no longer match the full collection from which it came. Such filtering leads to genomes such as the b38 genome (containing no alternate sequences, I believe), used for mapping, variant calling, etc. However, one could quite easily support identifying many of these sub-genomes by carefully defining a subgroup array and then automatically calculate digests for the sequences that are included in each subgroup, or in combinations of them:
e.g. if one divides human sequences into:
chroms
,MT
,alt
,unplaced
, I can at least see use for the following combos:chroms + MT
,chroms + alt
,alt + unplaced
(=decoy?). For four subgroups, there are 16 combinations, so there is in that case not much extra overhead in calculating all combinations.
The problem is obviously the general applicability of such categorization across sequence collections. For instance, what about protein sequences? However, I believe the advantages, especially in light of the FAIR principles, are so great that I think we should explore this path.
H. masks (sequence gaps?)
One can define many possible types of masks, but we have not (in the meetings I have been present, at least) delineated this much. There is at least one type of mask I would argue would be very useful as an array, namely a mask specifying the gaps in the assembly, perhaps in combination with contig placement (such as CONTIG in https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11). For DNA, one could also derive such gap info directly from the 'N' characters (which I believe would include also smaller gaps). There would be similar masks derivable from other alphabets, e.g. proteins, where gaps will be encoded differently (I believe, by a period character).
Example: There are situations where one would be interested in such assembly gap information without needing access to the sequence itself, for instance when randomizing the position of feature regions (such randomization should avoid the assembly gaps, e.g. the centromeres/telomeres). Since almost no track files really include basic information of which parts of the genome where the tracks are defined, having assembly gap information as part of the reference genome digest would be an easy way to attach such info to track files. This would improve the analysis of tracks tremendously, as the lack of a feature region in a loci also carries information, but only if such a lack is not due to a gap in the assembly. See e.g. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2438-1. Putting the track files into buckets with the same set of gaps in the assembly (independently of the base pairs) would be useful to find out which files can be analyzed together under the same assumptions. This is a use-case that is directly relevant for me.
I. alphabet
I would like to add 'alphabet' to the list of arrays. In my mind, this is not the same as the summary of the characters used in the sequence, but the possible characters that could have been used. Similar to the assembly gaps use case above, the fact that a character is not used in a sequence position is a separate piece of information given that we know the character was allowed in the first place. Hence, the alphabet (defined in this way) adds information to what is inherent in the sequence itself, and one can easily imagine sequences that are the same but are defined in relation to different alphabets. Having the alphabet available in recursion level 1 (together with the lengths) would also help if one needs to pre-register some data structure before the sequences are downloaded. Also, even if this is not the main argument, having the alphabet available would be a simple way to differentiate between e.g. DNA and protein sequences, without having to add a sequence type array/metadata field (which is btw also be a possibility, of course, but I think I will stop here...).
F-I. rules for non-core arrays
If all these non-core arrays (and more) are supported, there will be a large number of possible combinations of them. In this case, given 3 core configurations and 4 extra arrays, there are 3*2^4=48 total combinations. If one has all this information available, calculating the digest for all of the combinations seems a bit over-the-top, especially if in combination with different subgroups. Hence, it is probably smart to place the extra arrays into groups where all arrays in a group are required to be present, at least if one is generating digests for various combinations of arrays. Here is a suggestion of such rules:
iii) topology
, alphabet
and sequence gaps
are optional, but must be added together as a "sequence model" entity.
iv)sequence groups
are optional, but if added, digests should be created for subcollections of sequences with a particular value in the sequence groups
array, as well as for all combinations of these.
v) Digests for all allowed combinations of arrays according to the above rules should be calculated (in combination with the subcollection calculations).
Conclusion
In the case of 4 sequence group values, there are 3 (core) 2^2 (extra) 2^4 (seq groups)= 192 combinations to be calculated for each full sequence collection. This of course represents a bit of overhead, but we are still not talking about much data.
Using the above rules, one would for instance automatically have a resolvable identifier for the "chromlen" data of core human chromosomes in GRCh38, including sequence model information, something which would be incredibly useful for genomic track analysis tools/scripts. Instead of having to manually find and add such data using relevant command line parameters, one would just need to specify the seqcol digest, and the tool could automatically do the rest directly using the seqcol API. For me personally, this would solve some major implementation issues in an elegant way.
So what so you think? There is a huge number of details to discuss here, but there is also a discussion to be had on a more philosophical level on what kind of problems seqcol should solve. I am, as you can easily conclude from the above, in the "we have an opportunity to be heroes" camp!
And sorry for the length of this post...
Above is a lengthy discussion on what to include in the array-to-digest. This thread happened as we transitioned to the array-based digest method (also addressed in issue #10). Anyway, the new array-based approach changed how we conceptualized what is included in the sequence collection, and made it much more flexible. We sort of see it as being able to hold whatever arrays a user wants, with a few controlled vocabulary terms. It gives a lot of flexibility to the user, and also reshapes the question here a bit, to: which arrays should be mandatory? This is addressed in pending ADR in PR #18.
We haven't really finalized this, so I wanted to finish this. It's clear from discussion that sequences should be optional, and lengths should be mandatory. But what about names? Should names be mandatory?
A corollary is whether and where we enforce uniqueness. It's clear to me that we don't enforce uniqueness for lengths. I guess we also don't enforce uniqueness for sequence. But what about for names?
I think @daviesrob had some thoughts here as to the mandatory-ness and uniqueness constraints of the names array. Can we come to a conclusion here, now that we have increased clarity on what a seqcol is?
From memory the argument here is that a collection could be composed only of names and lengths. We also said it was likely a collection would define either names of sequences to stand alongside the lengths. I also think we suggested someone might just duplicate the checksums in sequences to provide names if we made names mandatory.
If the above is a good recollection then I would support making names/sequences required but only require one of those in a given collection. This also seems to align with a number of @sveinugu comments about length alone would not be a valid collection.
The comment about sequences being non-unique from memory is because two independent sequences could create the same checksum but have different associated metadata. Is there a reason why a sequence collection would be invalid if a collection had a total duplication of an entry? Sequence, name & length were just duplicated. I personally think if we can answer that question (perhaps we have done last week) then the uniqueness of names would be an easy thing to answer. If you can't duplicate an entry in a collection then names can be unique
I think you're right, Andy. I like the concept of "total duplication".
Is there a reason why a sequence collection would be invalid if a collection had a total duplication of an entry? Sequence, name & length were just duplicated. I personally think if we can answer that question (perhaps we have done last week) then the uniqueness of names would be an easy thing to answer. If you can't duplicate an entry in a collection then names can be unique
Maybe to just restate this question in another way: Which of the following data structures is a sequence collection?
Since our entries are split out into separate arrays, It will be easier to implement if we allow total duplication of an entry, because we must use lists under the hood to accommodate duplication within an array. However, if we don't need total duplication allowed for some reason, then implementing a check to prevent it may give us some nice computational properties down the line, such as being able to assume that the concatenation of name
+ sequence
is unique, or that name
must be unique.
I suppose we could enforce disallowing total duplication of an entry by simply requiring and enforcing that names be unique.
Unique names are required for some use-cases (e.g. SAM and VCF). Unless we can think of a really good case where having duplicated entries is necessary, I think enforcing name uniqueness would give us some very handy properties.
Looking over my notes from our previous discussion on this (2021-08-25) we had I think approached consensus that names should be required, and should be unique. If you don't have a names array, you should just generate one with numbers 1...x as the sequence identifiers. This provides uniquely identifiable sequences.
I can't really think of a situation where this would bite us. You could always stick your non-unique identifiers into some other custom array if you needed that.
So:
names are mandatory, and must be included to provide uniquely identifiable sequences
I am a bit uncertain about requiring names in that it removes one of the more straightforward applications of the sequence collections standard: to provide a single unique digest for a collection of sequences, given only the sequences themselves. So extracting the lengths from the sequences is simple and uniquely defined, there is no choice do be done for this. However, having to generate a names array with unique values (when this information is not present) can be done in a multitude of ways, with indexing from 0...n being one and duplicating the sequences array being another. Each of these will result in a different digest for the sequence collection. As a consequence we loose one of the simplest use cases where one can uniquely and consistently create one (and only one) digest from a bunch of sequences, given that there is no other metadata.
This BTW is also an argument for providing an order-invariant digest algorithm, as having to enforce an order into an unordered set of sequences is also a process where there are several canonical ordering schemes to choose from, each of which will result in a different digest for the same set of sequences.
In fact, defining a canonical ordering scheme to be able to uniquely generate a single order-invariant digest, given order is missing/not important (which I argue for in https://github.com/ga4gh/seqcol-spec/issues/5#issuecomment-1011061095) seems similar to defining a canonical naming scheme to be able to uniquely generate a single naming-invariant digest, given names are missing/not important. Even more, it seems the choice of defining a sequence collection as an ordered list of sequences opened up the need for also defining a canonical ordering scheme in cases where order is not defined. Similarly, making the choice of requiring the names column opens up the need for defining a canonical naming scheme in cases where names are not defined.
For me, if we are unable to provide a deterministic function:
f([list of sequences] => unique digest
then I suspect that we have over-engineered things.
I have no objections against requiring unique names, but if the only argument is to disallow total duplication, I believe we might be adding unnecessary constraints.
In conclusion, my suggestion is to require lengths
and one or both of sequences
and names
. If sequences are allowed to have duplicates (which I think should be the case), then total duplication of an entry would by consequence be allowed. However, it makes sense to me to require names
to be internally unique, so that one can depend on the names to uniquely identify individual sequences. Then, in scenarios where total duplications should not happen, one might decide to enforce this by adding an ad hoc requirement that the names
array is present. However, I don't think it should be required by the standard itself.
it removes one of the more straightforward applications of the sequence collections standard
Well, if we define "default names", then this application is not removed. Your argument is right, but only if we don't specify the approved way to construct default names.
the only argument is to disallow total duplication, I believe we might be adding unnecessary constraints.
Could be, but there are multiple arguments:
One way out of the problem, where you have sequences but no names might be to derive the names from the sequence checksums. For example, this python script will read in a list of checksums, and make names by combining the first six characters with a counter to deal with clashes. The counter values are set based on an alphabetical sorting of the checksums, preserving the original order for duplicates.
#!/usr/bin/python3
import fileinput
from collections import defaultdict
list = []
for line in fileinput.input():
list.append(line.strip())
seen = {}
count = defaultdict(int)
for hash in sorted(list):
prefix = hash[0:6]
count[prefix] += 1
seen.setdefault(hash, count[prefix])
for hash in list:
prefix = hash[0:6]
print(f"{prefix}-{seen[hash]:06d} {hash}")
seen[hash] += 1
So this input:
6aef897c3d6ff0c78aff06ac189178dd
f98db672eb0993dcfdabafe2a882905c
9f26945f9f9b3f865c9ebe953cbbc1a9
3210fecf1eb92d5489da4346b3fddc6e
a811b3dc9fe66af729dc0dddf7fa4f13
9f26945414693fe0a2399cf82e707e03
5691468a67c7e7a7b5f2a3a683792c29
cc044cc2256a1141212660fb07b6171e
cc044cc2256a1141212660fb07b6171e
c67955b5f7815a9a1edfaa15893d3616
9f26945414693fe0a2399cf82e707e03
6c198acf68b5af7b9d676dfdd531b5de
9f2694147561b361091791b9010cd28b
c0eeee7acfdaf31b770a509bdaa6e51a
gets these names:
6aef89-000001 6aef897c3d6ff0c78aff06ac189178dd
f98db6-000001 f98db672eb0993dcfdabafe2a882905c
9f2694-000004 9f26945f9f9b3f865c9ebe953cbbc1a9
3210fe-000001 3210fecf1eb92d5489da4346b3fddc6e
a811b3-000001 a811b3dc9fe66af729dc0dddf7fa4f13
9f2694-000002 9f26945414693fe0a2399cf82e707e03
569146-000001 5691468a67c7e7a7b5f2a3a683792c29
cc044c-000001 cc044cc2256a1141212660fb07b6171e
cc044c-000002 cc044cc2256a1141212660fb07b6171e
c67955-000001 c67955b5f7815a9a1edfaa15893d3616
9f2694-000003 9f26945414693fe0a2399cf82e707e03
6c198a-000001 6c198acf68b5af7b9d676dfdd531b5de
9f2694-000001 9f2694147561b361091791b9010cd28b
c0eeee-000001 c0eeee7acfdaf31b770a509bdaa6e51a
This has the property that sorting by name gives the same ordering as if you sorted by sequence checksum, which might be useful, e.g. for generating a canonical order:
3210fe-000001 3210fecf1eb92d5489da4346b3fddc6e
569146-000001 5691468a67c7e7a7b5f2a3a683792c29
6aef89-000001 6aef897c3d6ff0c78aff06ac189178dd
6c198a-000001 6c198acf68b5af7b9d676dfdd531b5de
9f2694-000001 9f2694147561b361091791b9010cd28b
9f2694-000002 9f26945414693fe0a2399cf82e707e03
9f2694-000003 9f26945414693fe0a2399cf82e707e03
9f2694-000004 9f26945f9f9b3f865c9ebe953cbbc1a9
a811b3-000001 a811b3dc9fe66af729dc0dddf7fa4f13
c0eeee-000001 c0eeee7acfdaf31b770a509bdaa6e51a
c67955-000001 c67955b5f7815a9a1edfaa15893d3616
cc044c-000001 cc044cc2256a1141212660fb07b6171e
cc044c-000002 cc044cc2256a1141212660fb07b6171e
f98db6-000001 f98db672eb0993dcfdabafe2a882905c
Could be, but there are multiple arguments:
- disallow total duplication (as you mention). this does have some value.
Yes, it does have some value, although I am uncertain of how important this is.
- for compatibility with SAM and VCF (Rob mentioned)
This is a very good argument. The difference between saying "seqcol is compatible with SAM and VCF" and "seqcol is compatible with SAM and VCF given that a names
array was used when generating the digest" is large and I think I am sold by this argument alone.
- it provides a natural, easy-to-understand, easy-to-implement solution to the canonical ordering question: simply order by sequence names.
It does in the cases where the sequence names are already given and/or the sequences already have a natural ordering. However, if one wants to create a name-invariant seqcol digest (if names are not available or are unimportant) and the order of the sequences are not given, one still needs to first order the sequences according to some deterministic algorithm. The fact that a names
array derived from the indices of the sequences can be then used to order the sequences is a circular tautology that does not help a bit in this scenario. Basing the names array on the sequence digests is a better solution, however it feels more like an ad hoc solution to a particular edge case issue than a general algorithm for naming/ordering. I will describe a suggestion below that I believe will solve this and most of the other core issues we are discussing (hint: it is a further development of my most resent idea for solving the ordering issue...)
- it enforces the common way we think of things, that the names are identifiers. I think this could actually prevent some downstream bugs and issues.
I definitely agree that enforcing unique identifiers would be very useful downstream. However, I feel the above is really only an argument for requiring names to be unique IF they are present, but not an argument for enforcing that the names column as a whole is mandatory. One can argue that it would be an advantage in many typical implementation scenarios to know a unique 'names' column will always be present. However, if unique keys are needed, one can always just generate them by adding indexes, etc.
However, in conclusion, I agree that there are strong arguments for making the names
column mandatory. But then I would argue that we should also define a canonical way of generating names if they are not present, and I will provide a suggestion for such an algorithm below that I believe will provide a bunch of advantages.
Edit: I will try to write out my suggestion tomorrow. Too late for me today.
So I have now tried to write out my idea in full (and then some).
(Warning: novel-length comment follows! This can be a nice read over the weekend, under a blanket by the fireside with a cup of tea...)
Before going into the details, I would like to take a step back and provide some thoughts as to what we are actually trying to accomplish with the seqcol standard, the way I see it (with the reservation that I was not part of this project from the start and might have misunderstood some fundamentals):
1. Create a persistent, globally unique identifier for a collection of sequences as defined in a repo of some sort, with a particular focus on reference genome repos.
2. That the identifier should be a deterministic and unique digest of the sequence collection contents, where the contents are defined in some (more or less) standard data structure.
3. Provide a standard for an endpoint for fetching a sequence collection given the identifier
4. Provide a comparison function that can return a useful overview of important similarities and differences between the contents of two sequence collections
5. Allow for use of the above in scenarios where the exact sequences are not important, e.g. to allow the creation of identifiers for coordinate systems that contain names and lengths, but not the actual sequences. The most prominent use case of this is as metadata for genome position annotation files (e.g. BED/SAM/VCF files) [which is what I would call "genomic track" files, as argued for here and onwards, but that is a digression].
That is more or less my impression of our goals. The last goal seems to be the least understood of the ones in the list, where I assume I have contributed significantly to the confusion by arguing at length and in great detail in issue https://github.com/ga4gh/seqcol-spec/issues/5 why I believe that an identifier that refer to a coordinate system should be based on an unordered data structure. One lesson drawn from this discussion is that it seems we are actually talking about three different types of identifiers:
A. Identifiers that match one-to-one to with a particular collection of sequences where the sequence content matters (goal 1 above). This is useful for database interoperability ("how to fascilitate comparison of the contents of different reference genome repos?"), for keeping data provenance ("what were the actual sequences used to generate this file?"), and a bunch of other scenarios.
B. Identifiers that match one-to-one to a particular chromSizes
(or similarly named) file, as used in e.g. UCSC tools or BedTools, in which the sequences are ordered. This would e.g. be practically useful to not have to maintain/include a particular chromSizes file when analyzing e.g. a BED file in a command line tool. Instead of providing a chromSizes file, which creates extra hassle in e.g. workflow scenarios, one could instead replace a chromSizes file with an identifier given as a string parameter to the tool.
C. Identifiers that match one-to-one to a particular coordinate system in its mathematical sense, where the order of the sequences do not matter. Such identifiers would be useful in at least some contexts, such as when building a file tree for a collection of BED files based on different reference genomes that are variations of the same major genome version (e.g. hg38). In this scenario, identifiers of type C could be used as the names of directories containing files that can be analysed together since they are based on the same coordinate system in its mathematical sense, even if the underlying reference genomes the BED files are based on have ordered the sequences differently.
(Btw: this will be more than a rehash of the unordered/ordered discussion, just be patient with me...)
Identifier type C might seem the least important of the lot, and in one sense I agree with this. However, it is the only identifier type that can be used to fully represent the "common shared denominator" of several sequence collections if these are differently ordered. Identifiers of type B can easily be replaced by identifiers of type A in scenarios where you are only analyzing e.g. a single BED file (and where you in any case should preserve the A type identifier to maintain provenance). However, in scenarios where you want to identify a shared coordinate system for several files (as in a Genome Browser), such an identifier should really be of type C, I believe. The alternative is to use an identifier of type B (or A) and say, in a sense, that the ordering of sequences in B (or A) is the canonical ordering and that all other sequence orders must be converted to follow B, even if there are no good reason to pick any particular order over another.
At this point you might be thinking that, yes, having to use type A or B instead of C might be a nuisance, but it should not keep you awake at nights, so to speak. We need to focus on A and possibly B. And if this was all that there was to this, I would properly agree.
However, there is a fourth type of identifier that we have not discussed in a principled way in the seqcol meetings:
This is really what the names
column is used for, and one of the issues plaguing Bioinformatics has been the inability of the community to settle on a standard for this. Instead, we still have to work around the chr1
vs 1
debacle. [Btw. there was a tongue-in-cheek sentence somewhere in the seqcol GitHub repo or web site that we were finally going to fix this, but I cannot find it now. Have we reduced our ambition level? :)] So the Refget digest is obviously a step in the right direction, but the reason why we still argue for making a name column mandatory must be that the sequence digest alone does not cover all the needs. I believe the main reason for this is the use of e.g. chr1
to refer to the coordinate system that is derived by the appropriate version of the reference genome (hence goal 5 above). However, if the reference genome identifier is missing or imprecise, it is impossible to know what exactly is meant by chr1
. As a conclusion, we aim to provide the identifier of type A to solve this (Goal 1), so that the combination of the seqcol
identifier and the name chr1
is a precise and globally unique identifier of the sequence!
But we have had persistent unique sequence collection identifiers all along for genome assemblies, like the GenBank assembly accession: GCA_000001405.29 (which refers to hg38 patch 14 that was released a week ago!). I believe one main thought behind seqcol
identifier type A is to replace such repository-dependent accession numbers with universal content-dependent digests. But there has been nothing stopping the community from embracing annotating e.g. BED files with the corresponding assembly accession numbers. However, imprecise identifiers like hg38
are still used, especially for genome annotation data (tracks). For instance, in the Galaxy framework, there is only one metadata field for reference genomes and it is by default based on UCSC naming (i.e. hg38). I believe the reasons for this are at least two-fold:
1. users provide whatever metadata is required by tools, with the UCSC Genome Browser being a prime example of this.
2. That hg38
catches something the accession numbers (or a seqcol
identifier A) do not
Regarding point 2, I believe hg38
is in practice used to categorize files that follow the coordinate system derived from the lengths of the (often only core) chromosomes of the latest version of the human reference genome, where the names follow the UCSC standard (to simplify interoperability). In addition, hg38
is humanly readable (and easily writeable, I still have to double-check every time I writeGRCh38
). Moreover, I have argued at length that in most cases hg38
logically refers to an unordered coordinate system (identifier type C instead of B), but in the current argument this is in any case a minor point.
My though has been all along that if we want people to replace hg38
with a seqcol
identifier, we should make sure that it solves the same problem. However, I am now slowly coming to realize that a seqcol
identifier will in any case not fill the same role as hg38
, at least not an identifier of type A. In practice, we aim for a seqcol
identifier to replace the assembly accession numbers in current use, but such identifiers are in any case underutilized in the genome annotation/track file analysis scenarios described above. I have argued for at least providing the possibility for an order-invariant seqcol
digest algorithm in order to make it possible to design a seqcol-compliant repository based on type C identifiers (as there will be at least one user of this, which is myself). But even if I will find such an unordered identifier very useful in my tool development, it has little practical chance to topple the hg38
dominance for analysis of genome annotation/track files for linear reference genomes, and I believe the only real possibility for seqcol
identifier dominance in the user community (=bioinformaticians) going forward is to be sure seqcol
ids are adopted for graph genome analysis scenarios (which we have hardly talked about).
So this long rant is really questioning whether we are really trying to solve the correct questions related to goal 5 (coordinate systems). So if the real issue is that the names used to identify sequences are imprecise and incompatible (i.e. chr1
and 1
), perhaps we should add as another goal for seqcol the following:
6. Define content-derived globally persistent and unique identifiers for single sequences within sequence collections (let's call these seqids
).
If each element within for example a BED file is annotated with a persistent identifier to the sequence it refers to, one can make sense of the data without requiring that the BED file itself is annotated with a precise enough reference genome identifier. I believe GFF, SAM and VCF have all gone this route, allowing the sequences to be precisely specified within the headers inside the file itself. For a BED file (maintaining compatibility by not adding any headers) this would mean exchanging the chr1
column with a seqid
digest. For compatibility, I believe this would actually be the best choice, given that there are machine-actionable ways of relating this identifier with chr1
or 1
or whatever. Every elemental annotation unit (e.g. a ChIP-seq peak) with be uniquely related the sequence from which it came, but with no need to change the file format.
So could not the refget digest be used as a sequence id? Well, I don't think so, as the main advantage of chr1
is that it is dependent on the length of the sequence only and not the exact sequence contents, allowing comparison of data e.g. across minor patches of the reference genome. Such an identifier would also be dependent on e.g. topology, as one wants to analyse regions on circular sequences differently than regions on a linear sequence. As I have argued for elsewhere, making the identifier dependent also on assembly gaps might be useful in certain scenarios. The point being that for analysis scenarios where chr1
is used today, we could precisely replace this with a seqid
that is a digest of a set of seqcol arrays for a single sequence, possibly including the sequence
array, but in analysis scenarios probably not, so that the seqid
instead refers to a coordinate system.
So what if we could support the following workflow?:
I. Input file i: A legacy hg38
BED file with UCSC-type chromosome names.
II. Input file ii: Properly annotated seqcol-compliant BED file based on the seqcol:af236b
sequence collection (identifier type A), where all regions are annotated with type D identifiers (another digest type, e.g. seqid:64ga09
) in the names column, uniquely referencing the sequence within the collection seqcol:af236b
.
III. Tool X is a BED file conversion tool that unifies the input files to use sequence names that precisely identify the coordinate system in the particular context needed for the analysis. For file i, it translates chr1
into the matching seqid
identifier, say seqid:3119ga
, but this one refers to a sequence collection element that represents a coordinate system (i.e. , where the sequence array is dropped). This conversion might require a chromSizes
file or might use some other logic or user input to map towards the correct seqcol
/seqid
identifiers. Note that the coordinate system records do not need to be hosted in any repo, as the digest can be directly generated from a chromSizes
files or a full seqcol
record. Similar conversion is done for file ii, however since the seqcol
identifiers of the sequences have already been specified, there is a simple one-to-one match between this and the related coordinate system identifier (in this example, the digest of the coordinate system derived from sequence item seqid:64ga09
by removing the sequences
array value might also be seqid:3119ga
). The output of tool X are two new BED files with the appropriate seqid
identifiers in the names
column that should match between the files given that they refer to the same concept of a sequence.
IV. Tool Y is an analysis tool that might or might not take advantage of the features that the seqid
identifiers provides. Standard and legacy software that takes BED files as input should work with no changes required.
V. References to seqids
in any output files or other results can be translated back to e.g. chr1
through APIs, libraries, or similar.
So how would this be implemented? Here's my suggestion:
names | seqids | lengths | sequences | topologies |
---|---|---|---|---|
chr1 | seqid:64ga09 | 150000000 | refget:a39a48 | linear |
chr2 | seqid:fa01482 | 120000000 | refget:001f16 | linear |
chr3 | seqid:7cc309 | 110000000 | refget:911b00 | linear |
So the seqids
column are generated as a digest of the per-sequence values in the other arrays except for the names column. We should require that the seqids
are unique within a collection, which I believe in most bases will not be a problem. In cases where there is full duplication for all arrays excepts names
, one would need to add another array to split them. In most cases I suspect there will be a natural such array that should be included. If not, one can always, as a standardized last resort, add a duplicate_index
array that, for each total duplication, counts from 0 (or 1) to the number of identical copies, e.g.
names | seqids | lengths | topologies | duplicate_index |
---|---|---|---|---|
chr1 | seqid:64ga09 | 150000000 | linear | 0 |
chr2 | seqid:fa01482 | 120000000 | linear | 0 |
alt_1 | seqid:7cc309 | 12345 | linear | 0 |
alt_2 | seqid:47g880 | 12345 | linear | 1 |
In addition to the advantages of including globally unique and persistent per-sequence identifiers that can be customized to almost any need, there are also a bunch of advantageous technical properties of this solution:
Since the seqids
array contains digests of the other arrays except the names
array, the top-level digest can be created by digesting the names
and seqid
arrays only.
seqids
are globally unique and can be used to compare individual sequences in different sequence collections, regardless of their names. In the comparison API, we would get extra counts of all the overlaps where everything but the name is equal.
Addition to the last point: for seqids
to be comparable across sequence collections, these must include the exact same arrays and follow the same vocabulary and logic. By this argument we might want to introduce a concept of subtypes
or templates
or similar that can be adopted in a standard way in a particular subgroup of seqcol repos, say across reference genome repos. Other arrays and vocabularies can be standardized for e.g. graph genome seqcol repos, or whatnot. We could define some of these, while others are free to define their own, if needed.
The names
and seqids
arrays provide a simple map to convert between different names, e.g. from chr1
to seqid:64ga09
in one seqcol and from seqid:64ga09
to 1
in another. This requires that the names
array also are unique within a collection, but I don't think anyone has any trouble with this requirement.
As the seqids
array is generated as part of the digest algorithms, it is not "mandatory" in the sense we have discussed it with names
, but it will still always be there and can be used as unique identifiers to the sequence records. The names
array then does not need to be mandatory or we can say that in case a names
array is not provided, the generated seqid
array will be used also for names
. Since the lengths
array can also be derived from the sequences, we are able provide a deterministic function from a list of sequences to a seqcol
digests, as I requested above.
We have a simple solution to generating the unordered digests I have argued for, namely to sort the arrays by the seqids
array prior to calculating the top-level digest. It does not really matter much whether this is useful or not, as it will be simple to add this to the spec given my suggestion, and there are plenty of other arguments for my idea independent of the order issue. Thus, we can also provide a deterministic function from an unordered set of sequences to a seqcol
digest.
Having globally unique identifiers for sequences (where equality comparisons are dependent on a particular template that standardizes the arrays and vocabularies) allows for network-style analysis, including semantic web searches etc, at the level of individual sequences.
As illustrated above in the BED analysis example, seqids
are usable in all scenarios and file formats that already support a sequence name. There is also no loss in information replacing a sequence name
with the seqid
, as the reverse mapping is easily available.
The seqids
might also be advantageous to use in the refget reverse lookup context (I have not followed this discussion closely). There will in any case be a one-to-many mapping between each refget
digests and the corresponding (non-coordinate-system) seqid
digests that might be taken advantage of in certain scenarios (I do not have any examples at the top of my head).
In the above writeup I at least aimed to add namespaces to all identifiers, so that they all follow the CURIE format. This adds compatibility with identifiers.org and n2t.net and is generally a good idea, I believe. This is of course independent of the rest of my idea, and I would in any case argue for providing CURIE identifiers. With the current idea CURIES makes it easier to combine different types of identifiers in a readable way. For simplicity and readability, I would thus argue that the namespaces should be always be included, also in the seqids
array, even though the namespace part is redundant. I am of course always open for arguments in other directions. I believe this is also an issue where the input from the rest of GA4GH is relevant.
We have (at least theoretically) solved the chr1
vs 1
debacle once and for all!
11 arguments should be enough but if you want another one: the idea is somewhat elegant, isn't it?
EDIT: 13. I forgot to add that total duplications are not by definition possible with this idea.
Thanks for writing this @sveinugu. I can't claim to follow your arguments completely, but I think I understand some of the main points. After reading through this and Rob's function suggestion above, I can add my thoughts:
My main conclusion is that: I think the canonical way to assign sequence names should be simply number them from 1 to X, and that's it.
My position comes from these basic points:
I pose that names should be required to be unique within a collection, and should also be mandatory. In this case, the possible scenarios of what a user may have with names/sequences are:
So, the issue is: what do you do when a human hasn't provided names for sequences?
Well:
length
, which is the only valuable information that would go into creating such an identifier anyway. This is a major complexity increase with, in my opinion, no benefit.I still think the best approach is to say "if you don't have names, you just number them" and that's it. In the event that a human has sequences but no names, then the names don't really matter. They are arbitrary and so we just number them to give them a unique identifier so the seqcol protocols can function correctly.
@sveinugu argues against this above here:
However, if one wants to create a name-invariant seqcol digest (if names are not available or are unimportant) and the order of the sequences are not given, one still needs to first order the sequences according to some deterministic algorithm.
But to me, this is missing the point. The sequences must be in some order. There's no such thing as a list of sequences that do not have an order, that's not the way computers or lists work. The seqcol is not handling order at all.
So I was asked in the meeting to provide a use case, and the best use case I have is my own. Even though this is a very specific use case, my thinking is that if it is possible to generalize the seqcol a bit in a principled manner to cater for my use case with no or only marginal costs to the existing functionality, I believe we would also probably cater for a range of other use cases that we haven't thought of yet.
So my core use case is in the context of developing tools for statistical analysis of BED-like files (which I call genomic track files). One of the more typical questions I would like to answer is this:
"Given two (or more) track files representing sets of intervals along a reference genome (say, two or more files of ChIP-seq peaks), are these sets of intervals co-localized (e.g. overlapping) more than expected by chance?"
The following Bioinformatics paper describes the implementational and methodological considerations for such type of investigations: https://doi.org/10.1093/bioinformatics/bty835
The most important consideration regards the question of what one means by "expected by chance", i.e. what is the assumed null model for the investigation? Statistical significance measures whether the observations of the data are more extreme than what is expected by the null model, so the realism of the assumptions that make up the null model are of uttermost importance. Too simplistic null models will result in a large number of false positives, where the statistical test will reject the null model and thus indicate a biologically relevant relation when there are none. In brief: If the assumptions that make up the null model are false (unrealistic), the statistical test is not answering the question we want it to answer.
So one type of statistical tests deployed to answer the question of co-localization are based upon randomization of the dataset(s) by shuffling the intervals to random positions along the coordinate system derived from the reference genome. So if the null model assumes uniform shuffling to all parts of the coordinate system, some of these randomized intervals will probably be moved to the centromere or other areas of the genome where they would not appear experimentally. Thus, the statistical test will in this case assume that some intervals "by chance" will appear in centromeres, while in fact all of the ChIP-seq peaks in the track data files by the very nature of the experimental and data processing methodology will be avoiding the centromeres. In this case, the ChIP-seq files will in fact be co-localizing (outside of the centromeres) more than expected by chance (which here assumes that peaks might as well appear in the centromeres as elsewhere). The null model will be correctly rejected by the test, however, the results are of no use. The only thing we have tested is, simply put, whether the data files contains ChIP-seq peaks, which is something we knew in advance.
So this long story is just to provide the background for why the gaps in the assembly (which includes the centromeres) are very important for this kind of analysis. Other restrictions on the randomization of the data should also be taken into account, but I will leave that out of this discussion and focus only on the assembly gaps. (Btw: @nsheff I believe one reason why we have differing views on the nature of coordinate systems is that your statistical methodologies, at least LOLA, are based on another way to restrict the null model, i.e. by having the user define a set of regions that define the universe for the analysis). The conclusion is this: If I want to represent the coordinate system as a sequence collection (and I do), such sequence collections would include at least the lengths
and the assembly gaps
, as well as some identifier (names
, or possibly seqids
if we decide to support this).
This represents an additional level of complexity compared to the way we have previously defined a coordinate system. For given that we know the exact sequence collection used to generate each file of ChIP-seq peaks, if these are all different patches of the same version of the reference genome (say: hg38), and even if we ignore the actual sequence contents, the assembly gap information would still differ (the patches are indeed filling gaps) for the same sequence across the seqcols defined by the data files. However, if we were to also include the the sequence
array , the variation in seqcol content would be greater for each sequence across seqcols. To conclude: including only the assembly gaps
represents a middle ground between a full sequence collection (that includes the sequence
array) and a coordinate system as we have defined it (without the sequence
array).
Furthermore, the coordinate system assumed by the null model should really be defined as a "largest common denominator" of the coordinate systems implied by each the files. One might, for instance, base the null model assumptions on a coordinate system with assembly gaps that represents a mean of the assembly gap information derived across all the files. Thus, the resulting coordinate system might be a "synthetic" one defined as a middle ground for analysis. In the same vein of thought, if the sequences differ in order across the per-file sequence collections, a "middle ground" coordinate system should have no order. Furthermore, if the names differ, the "middle ground" coordinate system should really not have any names, only name-invariant sequence identifiers.
Granted, there are ways of avoiding this complexity by e.g. specifying a canonical order or naming based on a specific sequence collection, but in that case the mathematically most correct solution will be unavailable due to limitations in the standard . Or, of course, one could decide to not follow the sequence collections standard at all. This will however limit adoption of the standard, in my case by one less developer.
In addition, specifying a particular sequence collection that defines canonical ordering or naming might have non-optimal consequences. For instance, in the domain, most BED-like files follow the non-standard UCSC-type names ('chr1' not '1'), even though the files themselves might have been generated using FASTA files based on standard naming. In such cases, one would probably as a data processing step carry out a string replacement operation to comply with the UCSC de facto standard. As a consequence, if a sequence collection identifier is attached to the file, it will end up out of sync with the actual names used in the file. Another alternative would be to define and use ad hoc "crossover" seqcols (where everything but the names stay the same), but these would probably not be registered in any seqcol repos. Lastly, if one wants to follow the sequence collection standard to the letter, one can opt for not carrying out such naming conversions in the data processing steps. However, this will reintroduce the need for the analysis tools to convert the names to follow a more or less arbitrarily selected canonical naming scheme (e.g. UCSC style). Avoiding such complexities was, I believe, one of the reasons why the UCSC-style names instead became the de facto standard at the level of data generation. As a conclusion, I believe it will greatly increase adoption of the standard in this domain if we provide a simple canonical way to automatically carry out such conversions (either between naming schemes or towards machine-operable identifiers, or both). Similar arguments can be made (and have been made by me previously) against canonical ordering, however I think the issues related to canonical naming are better at illustrating my concerns.
To summarize: I believe what my example mainly illustrates is that there will be many possible definitions of what is meant by a sequence in a sequence collection that are not covered by using the refget identifier to refer to a sequence. The advantage of the array-based solution is that it opens up for specifying whichever arrays are relevant in each context. However, providing limitations on ordering or naming will limit the landscape of potential applications that is opened up by the array solution. In my mind (and I'm sure you agree) I think we should weigh the benefits (as we percieve them) against the cost in functionality, complexity and/or time to overcome such limits in the standard itself. If the costs are very low, I believe we should by default move towards the most general standard.
Based on the feedbacks from the meeting, I believe the consensus is that the costs are still too high, and I agree, but they are in my mind only slightly too high. So, I have an idea for how to further lower the costs and provide even more benefits, but that will have to wait until my next post.
So here’s the latest version of my idea:
First, here is a table-based representation of our current solution. In this example, I have added an assembly_gaps
array with digests encoded according to some algorithm, as needed in the use case I described in my last post:
names | lengths | sequences | assembly_gaps | |
---|---|---|---|---|
seqcol:74dd74 | d217f2 | 955d23 | 28b8b8 | ffdd43 |
chr1 | 150000000 | refget:a39a48 | gaps:d29f28 | |
chr2 | 120000000 | refget:001f16 | gaps:99f651 | |
alt123 | 100000 | refget:a8f104 | gaps:c5620c | |
alt234 | 100000 | refget:36bf12 | gaps:c5620c |
So one thing that can easily be done is to convert the array digests into array identifiers, let’s call them arrayids
. The only thing needed to be done is to include the array names when calculating the array digests, so that two arrays with the same contents but different array names will not get the same id:
seqcol:f7a8c2 | arrayid:5d10c1 | arrayid:10f664 | arrayid:fd047f | arrayid:771cd3 |
---|---|---|---|---|
names | lengths | sequences | assembly_gaps | |
chr1 | 150000000 | refget:a39a48 | gaps:d29f28 | |
chr2 | 120000000 | refget:001f16 | gaps:99f651 | |
alt123 | 100000 | refget:a8f104 | gaps:c5620c | |
alt234 | 100000 | refget:36bf12 | gaps:c5620c |
The arrayids
will be globally unique, meaning that the same thing (array contents with a name) will always have the same id independently of in which seqcol it is used. In addition, the fact that the array names are unique within a seqcol (which is a technical requirement of our current solution) will make the arrayids
also locally unique. If not, arrayids
would have been globally, but not locally unique, and as such could not be used to identify a particular array within a seqcol (there could have been duplicate arrayids
in a seqcol).
So similarly, one can easily also define row-wise globally unique identifiers as a digest of all the values in a row, including array names. Let's call these identifiers seqids
, as above, but this time these identifiers are defined at the same meta-level as the arrayids
, thus outside the table itself.
seqcol:f7a8c2 | arrayid:5d10c1 | arrayid:10f664 | arrayid:fd047f | arrayid:771cd3 |
---|---|---|---|---|
names | lengths | sequences | assembly_gaps | |
seqid:64fa09 | chr1 | 150000000 | refget:a39a48 | gaps:d29f28 |
seqid:fa0148 | chr2 | 120000000 | refget:001f16 | gaps:99f651 |
seqid:3100f3 | alt123 | 100000 | refget:a8f104 | gaps:c5620c |
seqid:590f6a | alt234 | 100000 | refget:36bf12 | gaps:c5620c |
Since these identifiers are globally unique, they can be deterministically sorted and thus give rise to order-invariant sequence collections (with unique identifiers).
In order for the seqids
to also function as locally unique identifiers, one would need some additional constraint to make sure that totally duplicated rows are not allowed. Here is the constraint I suggest we require:
So, in contrast to my last suggestion, where I suggested we require that there should be no duplications excluding the names
array, this new requirement will not pose any constraint on the majority of use cases, because:
names
array must be unique, if present.Thus, adding a names
array will provide the uniqueness that is required by the first rule. As several people have argued: most sequence collections will have unique names. In most cases these two rules would represent the same level of constraints as a rule that a unique names
array is mandatory.
However, formulating the constraints with these two rules in addition allows for name-invariant sequence collections by just removing the names
array, given that this will not introduce any totally duplicated rows. This is the case in the current example, where removing the names
array will still not create any totally duplicated rows (due to the sequences
array):
seqcol:dd8a17 | arrayid:10f664 | arrayid:fd047f | arrayid:771cd3 |
---|---|---|---|
lengths | sequences | assembly_gaps | |
seqid:3f76aa | 150000000 | refget:a39a48 | gaps:d29f28 |
seqid:880c11 | 120000000 | refget:001f16 | gaps:99f651 |
seqid:caa71d | 100000 | refget:a8f104 | gaps:c5620c |
seqid:633382 | 100000 | refget:36bf12 | gaps:c5620c |
So the take-home message is this: This idea allows for globally and locally unique as well as fully configurable sequence identifiers (where the configuration is based on which arrays are included in the seqcol) without any practical negative consequences for the large majority of use cases. It also allows for both names
-invariant and order
-invariant sequence collection identifiers, based on the rule that there should be no totally duplicated rows in a seqcol. The idea also allows for globally and locally unique array identifiers, which is probably also useful.
Sidenote 1: on total duplications for egde case scenarios (added here for logical completeness, can be skipped):
The way I see it, for a seqid
to be globally unique means that the same concept of a sequence will have the same identifier, independent of the seqcol within which it is found, or its placement within that seqcol.
If we from the last table want to remove also the sequences
array, the last two rows will become total duplicates. In this case, the user will need to figure out a way to add uniqueness which is not dependent on the placement of the row within the seqcol. Adding an array with values that point to an external global reference point would solve the issue, as would removing all but one of the duplicated rows.
Hence, there exists a possible third requirement:
Here, dependency is meant in the statistical sense. If there are external variables that explains the values, which is the case for most sequence names (e.g. 'chr1' or '1' is the largest chromosome in the human genome), then the requirement still holds.
The main consequence of this rule is that about the only thing one should not do is this:
Given a set of unnamed sequences with no apparent order, one should not use the arbitrary index of the sequences as the default name, which was suggested above. This will ruin the global uniqueness and hence also the possibility for order-invariant seqcol ids (sorting of seqids
will no longer be deterministic).
The natural solutions given one has a set of unordered sequences is either: 1) remove any duplicates and create a seqcol that is invariant of both names
and order, or 2) create an external registry of names (or a FASTA file) and use those to define the names
array. If this is done before creating the seqcol, then the duplicates are no longer "the same concept of a sequence" in the view of the seqcol standard.
(Edit, removed the possibility of canonical sorting, as this will not solve the problem. Total duplicates are per definition "the same concept of a sequence" and should always have the same identifier.)
However, as has been argued by others:
This is a very edge case scenario, and I don't believe it needs to enter into the standard itself. It is a problem for advanced users with use cases that are out of scope of the standard, and they will probably figure something out anyway.
Sidenotes on alternative seqcol
digest algorithms that I do not propose (can definitely be skipped)
Sidenote 2a:
The suggested redefinition of arrayids
to include array names has as a consequence that one could define the seqcol
digest based on only the arrayids
and not the array names. This will have the same degree of information content as the current definition. However, for convenience, I believe the seqcol digests should in any case also include the array names (as we are currently doing), even if these are already included as basis for the arrayid
digests. The array names are in a sense redundant, but this redundancy also does not cost much and might be beneficial.
Sidenote 2b:
Given that seqids
digests are now auto-calculated alongside the arrayids
. If so, one could possibly define a seqcol
digest only based on the seqids
instead of the arrayids
. However, this will introduce two different digests/identifiers of the same entity (one based on arrays/columns, the other based on rows). I see no need for this. The current array-based digests should fit the bill.
I understand the use case above about statistics -- but, to me, this is solved by the gaps
array, not by a redefining a row-wise identifier.
I'm trying to condense all this down into what you're actually recommending, and I think it's this: You advocate for:
If we do it like that, this satisfies you, correct? Because it seems to me you can then build all the things you are talking about here on top of the seqcol implementation. So, the follow-up question is: are you sure there's no way for you to build what you describe under the constraint that names are mandatory?
Where you specified a need for nameless collections was here:
Furthermore, if the names differ, the "middle ground" coordinate system should really not have any names, only name-invariant sequence identifiers. ... Granted, there are ways of avoiding this complexity by e.g. specifying a canonical order or naming based on a specific sequence collection, but in that case the mathematically most correct solution will be unavailable due to limitations in the standard
I believe in this case you could simply name the sequences according to the way you've defined seqid
. Create your row-wise sequence representations, including the length and the gaps identifier, and digest those to create the row-wise identifier seqid
. Now, just set the names of the sequences to that. If I understand correctly, this solves the problem and also gives you the "mathematically most correct solution". It is also much simpler than the approach of defining seqid
rownames or arrayid
colnames, etc.
So, correct me if I'm wrong, but I believe I have understood your use case and your argument, and I still think it is possible to solve that even with maintaining the constraint that names be both unique and mandatory. I think given the computational benefits and simplifications we gain downstream from assuming names to be unique and present, it is worth enforcing this constraint.
It strikes me that my proposal above is kind of a mix between what @daviesrob suggested and what @sveinugu suggested. Rob's suggestion was default names based on a the refget digest, but this fails if you lack a sequences array, and also doesn't satisfy the uniqueness constraints that sveinung needs. Sveinung's proposal generates unique row identifiers that satisfies his needs, but put them into a new concept of seqid
instead of into the names
array.
So, just combine the two -- generate the seqid
, but don't put it in a separate column, just put it in the names
array. After all, we've defined them as a "unique, mandatory identifier for a sequence" and isn't that exactly what the seqid
you propose is?
It strikes me that my proposal above is kind of a mix between what @daviesrob suggested and what @sveinugu suggested. Rob's suggestion was default names based on a the refget digest, but this fails if you lack a sequences array, and also doesn't satisfy the uniqueness constraints that sveinung needs. Sveinung's proposal generates unique row identifiers that satisfies his needs, but put them into a new concept of
seqid
instead of into thenames
array.So, just combine the two -- generate the
seqid
, but don't put it in a separate column, just put it in thenames
array. After all, we've defined them as a "unique, mandatory identifier for a sequence" and isn't that exactly what theseqid
you propose is?
First of all, my idea is not anymore to put the seqid
into a separate column, but to provide these in a higher level view at the same level as the current array digests (just row-wise instead of column-wise). So the columns/arrays would stay the same as you envision them. However, instead of adding a restriction on the names column so that names can be used as unique local identifiers, I suggest to make the standard even more general by placing the identifiers at the machine-interpretable level which is were I believe they belong. Then, we can keep the names
array semantically clean as only providing human readable content. So basically we will be saying that in view of the standard, we don't care whether you name a sequence "chr1", "1", or don't provide a name at all. Instead we provide this mechanism for uniquely identifying a sequence in a machine-operable way, together with a mechanism for dynamically defining the scope of what you would like your identifiers to refer to (by adding/removing arrays). And if you really want to use the names
as local identifiers, we will guarantee that they are unique if present.
Also, I might not have conveyed the full argument for my seqid
suggestion, it is not just for my use case, but I think in general having a way to uniquely identify a row in a sequence collection would be highly useful in many other scenarios (not least within the implementations themselves). One such possibility is to use the seqid
for naming conversion. Say that you want to convert from a "chr1..." based sequence collection to a "1..."-based. So given that the arrays are the same, you could easily create name-invariant sequence collections for both (by removing the names
array). Then the seqids
of these two "middle ground" sequence collections would match and could be used to map sequences from one collection to the other, independent of order and naming.
So following your arguments, you could say that one in that case could just define a new names
column for these middle-ground sequence collections and it would technically work. However I feel we would then pass an opportunity to make the sequence collections generally useful to a range of use cases that we will not have a full overview of (but which I am sure we can brainstorm some examples of). What we would leave out is then:
names
identifiers will only be locally unique.names
array out)I believe what is needed for the implementations to support this is simply to add an additional endpoint variant that provides the top-level view in a row-wise fashion with seqids
, as an alternative to the current column-wise one. API-wise, this can be done by simply adding a parameter row_wise
(or similar) to the top-level endpoint. The rest would be the same.
I also might be mis-understanding the proposal here but I feel to a point there is a circular argument developing here since we come back to names and lengths being attributes of sequences which could be used as an identifier. In addition we now have gaps. As a genome provider I don't believe we actually know where those gaps are. This maybe back in the original AGPs but as we move more towards T2T based assemblies, these gaps are not going to be a unique identifier for elements within the collection.
Perhaps the bit I cannot quite fathom is:
As a user of a genome/sequence collection I can't control any of the above three sources of differences. Also I don't see how any system eliminates this when it is partly based on those major sources of differences. I also admit this might be me just not quite getting the thrust of the case for support for seqids or how they're being generated
there is a circular argument developing here since we come back to names and lengths being attributes of sequences which could be used as an identifier.
Right... correct me if I'm wrong, but the seqid
concept is in fact identical to the original idea that I called "Annotated Sequence Digests", which is actually how I originally implemented sequence collections. The term "annotated sequence digest" is meant to convey the difference from an original refget sequence digest, which digests the sequence only, while an Annotated Sequence Digest (ASD) digests the sequence with its annotations (other metadata columns).
Here's the ADR for where and why we decided not to do it that way: https://github.com/ga4gh/seqcol-spec/pull/14/files
I think @sveinugu is proposing to include both types of digests in the sequence collection, which is a bit different from the original discussion, where we were deciding on one or the other. It would basically combine the old and new ways of computing the digests. So, a level 1 secol in total would look something like this (ASDs
== seqids
):
{
"lengths": "4925cdbd780a71e332d13145141863c1",
"names": "ce04be1226e56f48da55b6c130d45b94",
"sequences": "3b379221b4d6ea26da26cec571e5911c",
"ASDs": ['ca1cd99...', '22c090c...', '426d38c...']
}
As Sveinung proposed, these could be served via different endpoints. Some thoughts:
ASDs
/seqids
will need to become a special name at this level that is treated differently from the others.I think the big-picture point here is that we have to distinguish between these two things:
These are not the same thing. For example, we once discussed removing the length from the digest calculation, because it doesn't confer any additional identifiability, given the sequences. We decided it makes more sense to keep it in, for consistency, because the sequence array is not always present. Even if we had decided to remove it from the digest calculation, the server could still store and return those values. In this case, we are facing a similar situation. From the perspective of the digest calculation, the ASDs add no uniqueness. The only value they could add is as a return value from the server, a reshaping of existing data. There are probably lots of other reshapings of existing data that could be useful for various specific downstream analyses. Should we build all of them into the spec? Or should we just make it possible to enact all of these things easily on top of the spec, for the use cases that would require them?
What would we need to do to enable these ASDs within the current spec?
Allow a server to return additional layer 1 elements that are not included in the digest algorithm. Actually, you could already do this.
The cost of making this a part of the specification would be:
I do think this is useful. I do not know if it's worth making this a universal requirement. I also do not think requiring the names array will break the ability to do the ASDs. I think the implementation of ASDs can simply generate numeric indexes to satisfy the requirement for the underlying sequence collection, and then simply ignore the names array for ASD digests.
I also might be mis-understanding the proposal here but I feel to a point there is a circular argument developing here since we come back to names and lengths being attributes of sequences which could be used as an identifier. In addition we now have gaps. As a genome provider I don't believe we actually know where those gaps are. This maybe back in the original AGPs but as we move more towards T2T based assemblies, these gaps are not going to be a unique identifier for elements within the collection.
So I am not really competent when it comes
to the technology, process and file/data formats related to assemblies. My probably naïve view of assembly gaps is in its simplest form the N
characters of the sequence. So for simplicity, in my use case I would say that two coordinate systems (and sequences) are the same if they have the same N
s, without regards to the other characters.
Right... correct me if I'm wrong, but the
seqid
concept is in fact identical to the original idea that I called "Annotated Sequence Digests", which is actually how I originally implemented sequence collections. The term "annotated sequence digest" is meant to convey the difference from an original refget sequence digest, which digests the sequence only, while an Annotated Sequence Digest (ASD) digests the sequence with its annotations (other metadata columns).
Yes, I believe you are right. I was not part of those discussions, but I came to the same idea through a bunch of trial and error trying to provide solutions for some limitations of the array-based variant, namely the ability to provide order
-invariant and now also name
-invariant sequence collections. I also like the name “Annotated Sequence Digests” (which was new to me now).
Here's the ADR for where and why we decided not to do it that way: https://github.com/ga4gh/seqcol-spec/pull/14/files
I think @sveinugu is proposing to include both types of digests in the sequence collection, which is a bit different from the original discussion, where we were deciding on one or the other. It would basically combine the old and new ways of computing the digests. So, a level 1 secol in total would look something like this (
ASDs
==seqids
):{ "lengths": "4925cdbd780a71e332d13145141863c1", "names": "ce04be1226e56f48da55b6c130d45b94", "sequences": "3b379221b4d6ea26da26cec571e5911c", "ASDs": ['ca1cd99...', '22c090c...', '426d38c...'] }
So my proposal would be to add an additional, independent object which is just the array of the ASD, which will take no part in the existing structure.
As Sveinung proposed, these could be served via different endpoints. Some thoughts:
- The ASDs included like this will have the backwards compatibility problems that we were trying to avoid -- but only for the ASD entries, the array digests stay clean. Thus, this will only cause problems for analyses that require the ASDs. What this means is, if you add another array, all your ASDs change, but the array digests don't.
I agree, but the way I think of that is that it is really an advantage. Adding an array is really redefining the conceptual model of what you consider to be an Annotated Sequence, and thus it’s digest should change correspondingly. Thinking of the digest as a globally unique identifier, if the concept of a sequence that you refer to change, the identifier should also change to reflect that. For instance removing the names
array will generate a concept of the sequence which is dependent only on its functional characteristics, and the ASDs of such a collection will be machine-operable identifiers that can function as a replacement of the humanly readable, but error-prone, "chr1, chr2,…” or “1, 2,…” names.
- Including the ASDs like this adds no value to the uniqueness of the level 0 digest. In fact, including ASDs at this level will break the current algorithm because they are not singletons, but arrays, which is currently not allowed at this level. This brings up the point that these digests could be computed and stored, but not used to determine the level 0 digest. I would suggest that, if we include ASDs at level 1, we not use them in the calculation of the layer 0 digest. So,
ASDs
/seqids
will need to become a special name at this level that is treated differently from the others.
Or not include them in this structure at all, but instead provide an alternative and independent object/endpoint.
- Including ASDs will break the structure that each layer 1 element is a singleton, representing an array of the same length. This will break the nice property that the layer 1 object size is independent of the size of the collection. The 2-endpoints idea solves this; either making them mutually exclusive, as sveinung suggested, or having an ednpoint that returns the level 1 collection with ASDs and another that returns it without ASDs.
- I believe most applications will not use these ASDs. But, it may not be too difficult to include them as an option for use cases that would use them.
I agree.
- I don't believe there would be any problem adding these ASDs on top of a basic sequence collections implementation for a use case that required them, provided we somehow allow for non-digest-affecting collection elements. But, this could all be handled outside the seqcol spec as an implementation detail, I think.
I think the big-picture point here is that we have to distinguish between these two things:
- what elements and structure is used to compute the digests.
- what elements and structure is available for retrieval via an endpoint.
I agree. My idea is to add the ASDs as an endpoint output only (2). But a mandatory one. The reason for this requirement is that, if we assume most reference genome providers will adopt the sequence collection standard, then having the ASDs also available from all of them will open up use cases to researchers and developers that will not available otherwise, even if one designs one own functionality on top of the sequence collections standard (unless one manages to convince all of them to implement the same additional functionality). So for me, this is a pragmatic issue more than a theoretical one. So one can easily implement search/indexing tools externally, but only if these ASD identifiers are available from the databases. Otherwise, one would need to replicate the full database contents in order to calculate these digests.
(snip) There are probably lots of other reshapings of existing data that could be useful for various specific downstream analyses. Should we build all of them into the spec? Or should we just make it possible to enact all of these things easily on top of the spec, for the use cases that would require them?
I believe the row-wise digests is in a special category and not comparable to “lots of other reshapings”. It is the main logical alternative to the array-based digests.
What would we need to do to enable these ASDs within the current spec?
Allow a server to return additional layer 1 elements that are not included in the digest algorithm. Actually, you could already do this.
The cost of making this a part of the specification would be:
- required implementation of the ASD endpoint (or changing the seqcol endpoint to include ASDs)
- required duplicate digest computation, since we have to digest them in both ways
I agree.
- required new tooling to differentiate non-digested vs digested elements stored with a seqcol
- required specification for what property names are reserved for non-digested elements, or some way to specify that.
I still don’t get this part.
I do think this is useful. I do not know if it's worth making this a universal requirement. I also do not think requiring the names array will break the ability to do the ASDs. I think the implementation of ASDs can simply generate numeric indexes to satisfy the requirement for the underlying sequence collection, and then simply ignore the names array for ASD digests.
Using numeric indexes as ‘names’ makes the ASD dependent on the order of the sequence within the collection, which breaks the possibility to use this to identify equality of Annotated Sequences between collections. Instead I propose that names
are not mandatory, but that the ASDs are, so one could always use ASDs to uniquely identify a row within a sequence.
EDIT: i misread you. I see now that you explicitly suggest to ignore the names
array when generating the ASDs. I believe there is value to both including and excluding the names array when generating ASDs, which is why I suggest that ASDs are always created from all arrays, but that the sequence collection standard also allows the removal of the names
array altogether.
Defined now in: https://github.com/ga4gh/seqcol-spec/issues/50
Now that we have the inherent
attribute, we have defined on a per-schema basis, which attributes should be included in the string to digest.
We need to confirm the final contents of the string to digest. Right now, I think we had settled on these elements:
contents
chr1
)constraints:
Related to this, I initially described this with a JSON-schema that I use in my implementation, but this is a bit more complicated because it encodes the recursion (the fact that the sequence digest element is itself a top-level refgettable element., But, for reference if anyone's interested, it's here:
https://github.com/refgenie/seqcol/blob/master/seqcol/schemas/AnnotatedSequenceList.yaml
Open for discussion.