ga4gh / refget

GA4GH Refget specifications docs
https://ga4gh.github.io/refget
14 stars 7 forks source link

Sequence collection, ordered? or unordered? #5

Closed yfarjoun closed 9 months ago

yfarjoun commented 4 years ago

According to the current definition, a "Sequence Collection" is "a set of reference sequences", however, I believe that in order to be useful, this should be changed to an "ordered set of reference sequences" as a change in the order of the collection has implications downstream (for example a sorted bam file will not be considered valid if one changes the order of its sequence dictionary, but more generally, without a specified order, it will be up to the implementations to decide how to serialize any data that uses a collection.

yfarjoun commented 4 years ago

reading the algorithm, it's clear that the intention is that the set is in-fact sorted, according to step 2 in the algorithm:

Step 2. Concatenate the computed annotated sequence digests, in the order provided in the collection, delimited by ITEM_DELIMITER. (my emphasis)

So it should not be controversial to add the word "ordered" to the definition of "Sequence collection":

Sequence collection: A set of sequences.

Should perhaps become:

Sequence collection: An ordered set of sequences, i.e., a non-repeating list of sequences, where sequence identity is defined by their digest and length, not name.

nsheff commented 4 years ago

Yes, you are right. Thanks, I've made the first two changes.

However, I'm not sure about your last point -- in fact can a sequence collection have a repeated sequence with a different name? So perhaps we should be using the word 'list' instead of 'set' altogether.

Also, I am not sure what you mean by sequence identity is defined by their digest and length -- I would have said that sequence identity is defined by the digest alone.

yfarjoun commented 4 years ago

hmmm. I guess there are two ways to look at this:

I have an ordered set of named sequences, and you have a second ordered set of named sequences, the sequences are the same, but we are using different names (think hg19 versus b37)

1) I would like to know that we are using the same reference collection, but with a different naming scheme. 2) I would like to be able to get the actual collection that you use, including the names, since I want to agree with you on the naming convention.

in case 1) the refcol digest should only include the ordered digests of the sequences themselves, and in case 2) it should also include the names of the sequences.

I agree that the length of sequence seems to be irrelevant, but I kept it since the original algorithm has seq-digest, length & name, and I just wanted to drop the name part.

nsheff commented 4 years ago

I see -- this is going in the direction of compatibility tests.

but case 1) can also be solved by a digest that includes the name of the sequences, and then just ignoring the names.

I just posted another issue for discussion about how to actually encode these compatibility tests, and you can see how I was envisioning it: https://github.com/ga4gh/seqcol-spec/issues/7

There would only be 1 digest, which would be looked up, and then there's a function that operates on that digest to yield the answers to the questions you pose (and more).

So, I think there's nothing in the system preventing you from creating a collection that included 2 of the same sequence, with different names. It would all work.

nsheff commented 4 years ago

I think I see what you're saying -- you're suggesting that there actually be stored multiple digests for a sequence collection in order to answer different compatibility questions. I'd like to hear your feedback on #7, where I proposed a way to be able to answer all the same questions, but without having to rely on multiple digests.

sveinugu commented 3 years ago

Just realized that my posts about ordering in issue https://github.com/ga4gh/seqcol-spec/issues/7#issuecomment-844228761 should have been posted in the current issue.

sveinugu commented 3 years ago

Here is an attempt to write up the different solutions to the ordering issue that have been discussed in the latest meetings. First, I would like to try to define exactly what are the core issues we are discussing.

1a. Should a sequence collection digest refer to an ordered list of sequences or not?

As changing the order of the sequences matters in some scenarios (which? read mapping?), it seems we agree that we need to support digests for ordered lists of sequences. Also, it seems that such a digest would form the canonical digest for e.g. a particular reference genome. However, if we have not already done so, I believe we should document the arguments for this clearly.

1b. The question could then be reformulated: Should we support digests for unordered lists of sequences in addition to the ordered ones? And if so, at which layer should we support such digests? (referring here to the layers suggested in https://github.com/ga4gh/seqcol-spec/issues/10)

Should digests of unordered arrays be supported on layer 0 only? A use case of this is in order to easily compare two seqcols that have equal content except of the order of the sequences, as exemplified in e.g. https://github.com/ga4gh/seqcol-spec/issues/7#issuecomment-844228761.

Furthermore, should we in addition support a top-level digest for unordered arrays? This would e.g. allow the calculation of a digest that represents the concept of a coordinate system (names and lengths) where the sequences are conceptually unordered. I believe this would be the mathematical concept that most precisely captures the "genome assembly" as used in e.g. the UCSC Genome Browser. As discussed in relation to the new GA4GH specification of the BED format, the chrom (chromosome name) field "may be sorted using any scheme (such as lexicographic or numeric order), but all data lines with the same chrom value should occur consecutively". In particular, the UCSC Genome Browser team recommends lexicographic sorting of chromosomes for generating bigBed files. This has several consequences:

i. A UCSC-ordered "chromlength" seqcol digest will only match ordered "official" reference genome digests if the latter are lexicographically sorted.

ii. As a BED file can follow any ordering of chromosomes and since the UCSC Genome Browser accepts BED files, an identifier (seqcol digest) to be used instead of the current "genome assembly" names should really conceptually represent an unordered digest (as mentioned above).

iii. Even if one has a chromsizes file containing the lengths and names of the sequences, there is in practice no guarantee that this file is ordered in a canonical way, even if the "canon" is the lexicographic sorting of UCSC.

All the above are arguments for including unordered digests at both layer 0 and the top level.I

nsheff commented 3 years ago

it seems we agree that we need to support digests for ordered lists of sequences. Also, it seems that such a digest would form the canonical digest for e.g. a particular reference genome. However, if we have not already done so, I believe we should document the arguments for this clearly.

See #17

nsheff commented 3 years ago

How to encode sequence order: ordered inputs vs order arrays

We previously established that the sequence collection digest will reflect sequence order (See ADR in #17).

There are 3 proposals for how to accommodate this:

  1. digest arrays in given order
  2. define a canonical order for input in general, and add a separate array for order, which indexes into the arrays
  3. order each input array individually, and add a separate order array for each input array

Option 1 amounts to no support for unordered collections, as mentioned by @sveinugu above. Options 2 and 3 would provide support at layer 1 (if I've understood the layers correctly).

So, advantages of order array are:

  1. It provides support for digests for unordered collections.
  2. it may make it easier to ask inverse question: 'given a set of sequences, find all collections that include this' (but only for identity) -- because you don't need to go down so many layers.

But a disadvantage of the option 2 canonical order array, is that it adds some complexity, specifically on determining canonical order. This requires setting new restrictions on attributes (like names must be unique), or a complicated series of tests for tiebreakers, like if name is the same, then go to topology... and so on, which becomes problematic with the possibility of custom arrays. Well-summarized by RD: "each column we can order easily, but combinations of columns make it more difficult."

This leads to option 3: we simply order each included array lexicographical, and then include an order_ATTRIBUTE array for each array in the collection, which indexes into that attribute. This provides the benefits of the order array, without the drawbacks of having to define a canonical order. A disadvantage is that you have to include an order index array for every other array, which seems to add complexity and space.

sveinugu commented 3 years ago

2. How should we represent the order in the digest algorithm?

Three answers have been suggested:

A. We only support ordered digests at all layers, i.e. all arrays are ordered. Queries on unordered arrays can instead be implemented in the comparison API.

B. All arrays are unordered in themselves, and in addition we add the order array.

In order for the digest algorithm to be able to uniquely determine the digests of the unordered arrays, they need to be sorted according to a canonical sorting scheme. In addition, an order array is added to translate from the canonical ordering to the order represented by the sequence collection. For the order array to make sense, there is the additional requirement that the canonical sorting scheme fully defines the global index across all arrays, also supporting the addition of eventual custom arrays.

Arguments for this solution is previously provided here: https://github.com/ga4gh/seqcol-spec/issues/7#issuecomment-844228761 and https://github.com/ga4gh/seqcol-spec/issues/7#issuecomment-853109393. This suggestion has been debated at length, with the main difficulty related to the details on how to specify the canonical sorting. It has been agreed that sorting on descending length seems to be the natural choice, but the problems arise in the edge cases where several sequences are of the same length. One would then need to use either names or sequence digest (or both) as tie-breakers, but this would only guarantee to fully specify an order if either duplicate sequences are not allowed, or we require that the names array is mandatory and contains unique values (or logically that at least one of these holds up). Having to add such restrictions in order to solve 1b seems inelegant and might cause issues downstream. The alternative is to use all arrays for tie-breaking in some fashion. However, adding an extra array would then potentially cause the global indexing to change, and consequently cause the order array and digest to change, which seems impractical. Indeed, the main argument against B can be distilled to the following:

The definition of canonical order would introduce a dependency between the unordered array digests, even if this is limited to only the four main ones: names, lenghts, sequences, and order.

C. All arrays are maintained as both ordered and unordered lists/digests independently of each other

This could be done by maintaining two versions of all arrays, e.g.:

names_unordered = ['chr1', 'chr10', 'chr11', ..., 'chr2', 'chr21', 'chr22', 'chr3', ..., 'chr9', 'chrM', 'chrX', 'chrY']
names_ordered = ['chr1', 'chr2', 'chr3', ... 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']

Or equally by adding an order array for each array:

names = ['chr1', 'chr10', 'chr11', ..., 'chr2', 'chr21', 'chr22', 'chr3', ..., 'chr9', 'chrM', 'chrX', 'chrY']
names_order = [0, 9, 10, ..., 1, 20, 21, 2, ...,8, 24, 22, 23]

In either case, the unordered arrays could then be canonically sorted lexicographically (or numerically in case of `lengths) independently of the other arrays. Also, one could calculate digests for each of them in order to more easily find whether two seqcols differ in order and/or content, now also on a per-array basis, e.g.:

Given the current solution, we define two seqcols, X and Y.

X:
https//seqcolserver.org/digest/3bacd6/1:
{
    'lengths': [34567, 23456, 12345, 456],
    'names': ['chr1', 'chr2', 'chr3', 'alt_12345'],
    'sequences': ['acb341', 'c23ba4', '2314af', '32d2bd'],
    'topology': ['linear', 'linear', 'linear', 'linear']
}

Y:
https//seqcolserver.org/digest/746ac2/1:
{
    'lengths': [456, 34567, 23456, 12345],
    'names': ['alt_12345', 'chr1', 'chr2', 'chr3'],
    'sequences': ['32d2bd', 'acb341', 'c23ba4', '2314af']
    'topology': ['linear', 'linear', 'linear', 'linear']
}

Here, the contents of X and Y are the same, but the order of the sequences differ. Currently, the 0-layer digests will not match (except for topology):

X:
https//seqcolserver.org/digest/3bacd6/0:
{
    'lengths': '36cd24',
    'names': 'fd1567',
    'sequences': '53f7cb',
    'topology': 'd8a349'
}

Y:
https//seqcolserver.org/digest/746ac2/0:
{
    'lengths': '68bdd5',
    'names': '170a01',
    'sequences': '34f9db',
    'topology': 'd8a349'
}

With solution C, the seqcols could be implemented as follows:

X:
https//seqcolserver.org/digest/3bacd6/1:
{
    'lengths_ordered': [34567, 23456, 12345, 456],
    'lengths_unordered': [34567, 23456, 12345, 456],
    'names_ordered': ['chr1', 'chr2', 'chr3', 'alt_12345'],
    'names_unordered': ['alt_12345', 'chr1', 'chr2', 'chr3'],
    'sequences_ordered': ['acb341', 'c23ba4', '2314af', '32d2bd'],
    'sequences_unordered': ['2314af', '32d2bd', 'acb341', 'c23ba4'],
    'topology_ordered': ['linear', 'linear', 'linear', 'linear'],
    'topology_unordered': ['linear', 'linear', 'linear', 'linear']
}

Y:
https//seqcolserver.org/digest/746ac2/1:
{
    'lengths_ordered': [456, 34567, 23456, 12345],
    'lengths_unordered': [34567, 23456, 12345, 456],
    'names_ordered': ['alt_12345', 'chr1', 'chr2', 'chr3'],
    'names_unordered': ['alt_12345', 'chr1', 'chr2', 'chr3'],
    'sequences_ordered': ['32d2bd', 'acb341', 'c23ba4', '2314af']
    'sequences_unordered': ['2314af', '32d2bd', 'acb341', 'c23ba4'],
    'topology_ordered': ['linear', 'linear', 'linear', 'linear'],
    'topology_unordered': ['linear', 'linear', 'linear', 'linear']
}

Which will provide a much more informative layer 0:

X:
https//seqcolserver.org/digest/3bacd6/0:
{
    'lengths_ordered': '36cd24',
    'lengths_unordered': '93f21d',
    'names_ordered': 'fd1567',
    'names_unordered': 'dd139a',
    'sequences_ordered': '53f7cb',
    'sequences_unordered': '82df11',
    'topology_ordered': 'd8a349'
    'topology_unordered': 'd8a349'
}

Y:
https//seqcolserver.org/digest/746ac2/0:
{
    'lengths_ordered': '68bdd5',
    'lengths_unordered': '93f21d',
    'names_ordered': '170a01',
    'names_unordered': 'dd139a',
    'sequences_ordered': '34f9db',
    'sequences_unordered': '82df11',
    'topology_ordered': 'd8a349',
    'topology_unordered': 'd8a349'
}

A main argument against C is that doubling the number of arrays makes the solution more complex. Also, it does not directly solve the problem of providing unordered top-level digests (in 1b), as it only provides array digests (layer 0). However, one can easily calculate such a top-level digest by digesting only the unordered arrays.

sveinugu commented 3 years ago

I mentioned in the last meeting that I had an idea of a variant of C which seems more elegant and useful. The idea is to add an intermediate level of the organisation to denote respectively the ordered and the unordered versions of the same seqcol. Here, the ordered array maintains indexing of elements across the arrays, while the unordered arrays are sorted individually (lexicographically or numerically [descending]). We can probably make this difference more directly understandable somehow, e.g. by renaming the sections. In the context of this discussion, I think ordered and unordered works well enough.

X:
https//seqcolserver.org/digest/3bacd6/1: {
    'ordered': {
        'lengths': [34567, 23456, 12345, 456],
        'names': ['chr1', 'chr2', 'chr3', 'alt_12345'],
        'sequences': ['acb341', 'c23ba4', '2314af', '32d2bd'],
        'topologies': ['linear', 'linear', 'linear', 'linear'
    },
    'unordered': {
        'lengths': [34567, 23456, 12345, 456],
        'names': ['alt_12345', 'chr1', 'chr2', 'chr3'],
        'sequences': ['2314af', '32d2bd', 'acb341', 'c23ba4'],
        'topologies': ['linear', 'linear', 'linear', 'linear']
    }
}

Y:
https//seqcolserver.org/digest/746ac2/1: {
    'ordered': {
        'lengths': [456, 34567, 23456, 12345],
        'names': ['alt_12345', 'chr1', 'chr2', 'chr3'],
        'sequences': ['32d2bd', 'acb341', 'c23ba4', '2314af'],
        'topologies': ['linear', 'linear', 'linear', 'linear'
    },
    'unordered': {
        'lengths': [34567, 23456, 12345, 456],
        'names': ['alt_12345', 'chr1', 'chr2', 'chr3'],
        'sequences': ['2314af', '32d2bd', 'acb341', 'c23ba4'],
        'topologies': ['linear', 'linear', 'linear', 'linear']
    }
}

We will then need to add another layer above the current layer 0, which will then become the new level 0. Here, we can directly observe that the two seqcols A and B are differently ordered versions of the same collection:

X:
https//seqcolserver.org/digest/3bacd6/0: {
    'ordered': 'ga4b88',
    ‘unordered': '11968f'
}

Y:
https//seqcolserver.org/digest/746ac2/0: {
    'ordered': '931dad',
    ‘unordered': '11968f'
}

With such a structure, we will directly compute digests that capture the concept of an unordered seqcol, as argued for in 1b above. At layer 1, we will get more detailed information about which arrays differ in term of order, compared to the equal unordered counterparts:

X:
https//seqcolserver.org/digest/3bacd6/1: {
    'ordered': {
        'lengths': '36cd24',
        'names': 'fd1567',
        'sequences': '53f7cb',
        'topologies': 'd8a349'
    },
    'unordered': {
        'lengths': '93f21d',
        'names': 'dd139a',
        'sequences': '82df11',
        'topologies': 'd8a349'
    }
}

Y:
https//seqcolserver.org/digest/746ac2/1: {
    'ordered': {
        'lengths': '68bdd5',
        'names': '170a01',
        'sequences': '34f9db',
        'topologies': 'd8a349'
    },
    'unordered': {
        'lengths': '93f21d',
        'names': 'dd139a',
        'sequences': '82df11',
        'topologies': 'd8a349'
    }
}

Here, we see that e.g. the topologies array is equal for all combinations of X and Y, ordered and unordered.

With such a solution I would argue for the unordered tree to be mandatory, as it will be very useful in the use cases that makes use of unordered seqcols that the corresponding digests are always generated in all seqcol-compliant servers. In contexts where order is irrelevant, the ordered arrays must still be present, as these are the only ones that maintains indexing across all arrays (as the unordered arrays are sorted independently of each other).

The usefulness of such an approach should become clear in the use case of comparing variants of the human genome of the same core version (e.g. hg38/GRCh38), with different names ('chr1' vs '1') and order. Here the unordered.lengths digests should still always be the same, while the unordered.sequences digests (if present) might differ if they are part of different patch versions of the same core version.

tcezard commented 3 years ago

I would advocate for the specification to allow for both ordered and unordered sequence collections: An implementation might want to support unordered only sequence collections. In this case the original order of the collection does not matter (from the point of view of the server) so all the collections only offers the canonical ordering. And we've already established that in many other cases the order is to be included in the collections.

I would argue that the best solution (out of the one described above) to support both these use case is the order array (option B in @sveinugu comment above). This option provides the simplest option to allow implementation to support or not the recording or the original order.

The issue of uniqueness of the field use to compute the canonical ordering is very common in Bioinformatics in general and I really can't imagine anyone being surprised if we impose that the name is unique within a collection. My vote is for option B 😄

sveinugu commented 3 years ago

The issue of uniqueness of the field use to compute the canonical ordering is very common in Bioinformatics in general and I really can't imagine anyone being surprised if we impose that the name is unique within a collection. My vote is for option B 😄

Unfortunately, neither sorting variant of B is really consistent to changes.

Consider a case of a series of four seqcols W, X, Y and Z that all consists of sequences with the same lengths, but where two of the sequences has the same length (456). At each step of this series, one aspect of the seqcol changes:

Option A:

W:
https//seqcolserver.org/digest/401d99/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'alt_12345', 'unplaced_23456'],
    'sequences': ['acb341', 'c23ba4', '2314af', '34e838']
}

W:
https//seqcolserver.org/digest/401d99/0:
{
    'lengths': '450fd9',
    'names': '8nba65',
    'sequences': '7dd615'
}

X:
https//seqcolserver.org/digest/adf36d/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'variant_12345', 'contig_23456'],  # Changed
    'sequences': ['acb341', 'c23ba4', '2314af', '34e838']
}

X:
https//seqcolserver.org/digest/adf36d/0:
{
    'lengths': '450fd9',
    'names': '4nebn2',  # Changed
    'sequences': '7dd615'
}

Y:
https//seqcolserver.org/digest/af93d8/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'variant_12345', 'contig_23456'],
    'sequences': ['acb341', 'c23ba4', '2314af', '12fed2']  # Changed
}

Y:
https//seqcolserver.org/digest/af93d8/0:
{
    'lengths': '450fd9',
    'names': '4nebn2',
    'sequences': 'b0ae26'  # Changed
}

Z:
https//seqcolserver.org/digest/9912af/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'contig_23456', 'variant_12345'],  # Order changed
    'sequences': ['acb341', 'c23ba4', '12fed2', '2314af']  # Order changed
}

Z:
https//seqcolserver.org/digest/9912af/0:
{
    'lengths': '450fd9',
    'names': 'ae9dnd',  # Changed
    'sequences': '2f9d13'  # Changed
}

Option B, canonical sorting = length (descending), names:

W:
https//seqcolserver.org/digest/401d99/1:
{
    'lengths': [23456, 12345, 456, 456],  # Primary sort key, descending
    'names': ['1', '2', 'alt_12345', 'unplaced_23456'],  # Secondary sort key (in case of ties)
    'sequences': ['acb341', 'c23ba4', '2314af', '34e838'],
    'order': [0, 1, 2, 3]
}

W:
https//seqcolserver.org/digest/401d99/0:
{
    'lengths': '450fd9',
    'names': '8nba65',
    'sequences': '7dd615',
    'order': 'b41814',
}

X:
https//seqcolserver.org/digest/adf36d/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'contig_23456', 'variant_12345'],  # Names changed, causing change in sorting
    'sequences': ['acb341', 'c23ba4', '34e838', '2314af'],  # Changed due to sort change
    'order': [0, 1, 3, 2]  # Changed due to sort change
}

X:
https//seqcolserver.org/digest/adf36d/0:
{
    'lengths': '450fd9',
    'names': 'ae9dnd',  # Changed
    'sequences': '4dd441',  # Changed
    'order': 'da76f5'  # Changed
}

Y:
https//seqcolserver.org/digest/af93d8/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'contig_23456', 'variant_12345'],
    'sequences': ['acb341', 'c23ba4', '12fed2', '2314af'],  # Changed
    'order': [0, 1, 3, 2]
}

Y:
https//seqcolserver.org/digest/af93d8/0:
{
    'lengths': 450fd9,
    'names': 'ae9dnd',
    'sequences': '2f9d13',  # Changed
    'order': 'da76f5'
}

Z:
https//seqcolserver.org/digest/9912af/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'contig_23456', 'variant_12345'],
    'sequences': ['acb341', 'c23ba4',  '12fed2', '2314af'],
    'order': [0, 1, 2, 3]  # Changed
}

Z:
https//seqcolserver.org/digest/9912af/1:
{
    'lengths': 450fd9,
    'names': 'ae9dnd',
    'sequences': '88d2e1',
    'order': 'b41814'  # Changed
}

Option B, canonical sorting = length (descending), sequences:

W:
https//seqcolserver.org/digest/401d99/1:
{
    'lengths': [23456, 12345, 456, 456],  # Primary sort key, descending
    'names': ['1', '2', 'alt_12345', 'unplaced_23456'],
    'sequences': ['acb341', 'c23ba4', '2314af', '34e838'],  # Secondary sort key (in case of ties)
    'order': [0, 1, 2, 3]
}

W:
https//seqcolserver.org/digest/401d99/0:
{
    'lengths': '450fd9',
    'names': '8nba65',
    'sequences': '7dd615',
    'order': 'b41814',
}

X:
https//seqcolserver.org/digest/adf36d/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'variant_12345', 'contig_23456'],  # Changed
    'sequences': ['acb341', 'c23ba4', '2314af', '34e838'],
    'order': [0, 1, 2, 3]
}

X:
https//seqcolserver.org/digest/adf36d/0:
{
    'lengths': '450fd9',
    'names': '4nebn2',  # Changed
    'sequences': '7dd615',
    'order': 'b41814',
}

Y:
https//seqcolserver.org/digest/af93d8/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'contig_23456', 'variant_12345'],  # Changed due to sort change
    'sequences': ['acb341', 'c23ba4', '12fed2', '2314af'],  # Sequences changed, causing change in sorting
    'order': [0, 1, 3, 2]  # Changed due to sort change
}

Y:
https//seqcolserver.org/digest/af93d8/0:
{
    'lengths': 450fd9,
    'names': 'ae9dnd',  # Changed
    'sequences': '2f9d13',  # Changed
    'order': 'da76f5'  # Changed
}

Z:
https//seqcolserver.org/digest/9912af/1:
{
    'lengths': [23456, 12345, 456, 456],
    'names': ['1', '2', 'variant_12345', 'contig_23456'],
    'sequences': ['acb341', 'c23ba4', '12fed2', '2314af'],
    'order': [0, 1, 2, 3]  # Changed
}

Z:
https//seqcolserver.org/digest/9912af/1:
{
    'lengths': 450fd9,
    'names': '4nebn2',
    'sequences': '2f9d13',
    'order': 'b41814'  # Changed
}

In conclusion, any change in the arrays used as secondary keys in the canonical sorting algorithm will trigger an order change in the other columns (except in the length column). Even though these are edge cases, this makes it impossible to make use of digest changes in comparison logic, which was the main argument for option B in the first place.

sveinugu commented 3 years ago

For completeness, this is how option C would handle the case above:

W:
https//seqcolserver.org/digest/401d99/1:
{
    'ordered': {  # Maintains indexing across arrays
        'lengths': [23456, 12345, 456, 456],
        'names': ['1', '2', 'alt_12345', 'unplaced_23456'],
        'sequences': ['acb341', 'c23ba4', '2314af', '34e838']
    },
    'unordered': {  # Breaks indexing across arrays
        'lengths': [23456, 12345, 456, 456],  #  Sorted individually. Coincidentally equal to the ordered version
        'names': ['1', '2', 'alt_12345', 'unplaced_23456'],  #  Sorted individually. Coincidentally equal to the ordered version
        'sequences': ['2314af', '34e838', 'acb341', 'c23ba4']  # Sorted individually. Differs from the ordered version
    }
}

W:
https//seqcolserver.org/digest/401d99/0:
{
    'ordered': {
        'lengths': '450fd9',
        'names': '8nba65',
        'sequences': '7dd615'
    },
    'unordered': {
        'lengths': '450fd9',  # Equal to ordered version -> seqcol is sorted by lengths
        'names': '8nba65',  # Equal to ordered version -> seqcol is sorted by names
        'sequences': 'e2c653'  # Differs from ordered version
}

X:
https//seqcolserver.org/digest/adf36d/1:
{
    'ordered': {
        'lengths': [23456, 12345, 456, 456],
        'names': ['1', '2', 'variant_12345', 'contig_23456'],  # Content changed
        'sequences': ['acb341', 'c23ba4', '2314af', '34e838']
    },
    'unordered': {
        'lengths': [23456, 12345, 456, 456],
        'names': ['1', '2', 'contig_23456', 'variant_12345'],  # Content changed and array sorted
        'sequences': ['2314af', '34e838', 'acb341', 'c23ba4']
    }
}

X:
https//seqcolserver.org/digest/adf36d/0:
{
    'ordered': {
        'lengths': '450fd9',
        'names': '4nebn2',  # Changed
        'sequences': '7dd615'
    },
    'unordered': {
        'lengths': '450fd9',  # Equal to ordered version -> seqcol is sorted by lengths
        'names': 'ae9dnd',  # Changed and different from ordered version
        'sequences': 'e2c653'
    }
}

Y:
https//seqcolserver.org/digest/af93d8/1:
{
    'ordered': {
        'lengths': [23456, 12345, 456, 456],
        'names': ['1', '2', 'variant_12345', 'contig_23456'],
        'sequences': ['acb341', 'c23ba4', '2314af', '12fed2']  # Changed
    },
    'unordered': {
        'lengths': [23456, 12345, 456, 456],
        'names': ['1', '2', 'contig_23456', 'variant_12345'],
        'sequences': ['12fed2', '2314af', 'acb341', 'c23ba4']  # Changed
    }
}

Y:
https//seqcolserver.org/digest/af93d8/0:
{
    'ordered': {
        'lengths': '450fd9',
        'names': '4nebn2',
        'sequences': 'b0ae26'  # Changed
    },
    'unordered': {
        'lengths': '450fd9',  # Equal to ordered version -> seqcol is sorted by lengths
        'names': 'ae9dnd',
        'sequences': 'c5a494'  # Changed and different from ordered version
    }
}

Z:
https//seqcolserver.org/digest/9912af/1:
{
    'ordered': {
        'lengths': [23456, 12345, 456, 456],
        'names': ['1', '2', 'contig_23456', 'variant_12345'],  # Order changed
        'sequences': ['acb341', 'c23ba4', '12fed2', '2314af']  # Order changed
    },
    'unordered': {
        'lengths': [23456, 12345, 456, 456],
        'names': ['1', '2', 'contig_23456', 'variant_12345'],  # No change
        'sequences': ['12fed2', '2314af', 'acb341', 'c23ba4']  # No change
    }
}

Z:
https//seqcolserver.org/digest/9912af/1:
{
    'ordered': {
        'lengths': '450fd9',
        'names': 'ae9dnd',  # Changed. Coincidentally equal to unordered version
        'sequences': '2f9d13'  # Changed
    },
    'unordered': {
        'lengths': '450fd9',  # Equal to ordered version -> seqcol is sorted by lengths
        'names': 'ae9dnd',  # No change.  Equal to ordered version -> seqcol is sorted by names
        'sequences': 'c5a494'  # No change
    }
}

With option C, logic is restored at the cost of complexity.

sveinugu commented 3 years ago

In the meeting yesterday, it became clear that option B was not logically coherent and elegant enough as a way to relate ordered and unordered arrays (as argued for in the previous comment). Also, the consensus seems to gravitate towards option A for simplicity. It would thus seem that not much progress have been made through the current discussion (but perhaps we have gained more clarity?).

Not fully eager to let go of the logical elegance of option C, but aware of the complexity, I thought a bit more into the situation and an idea came to me: What if we support "all of the above"? Clearly, option B is not preferable, so we should not recommend this, but I believe we could in principle still support it.

(Note: I accidentally submitted this comment. I am writing on a longer comment providing a detailed proposal that will come later...)

sveinugu commented 3 years ago

In the meeting yesterday, it became clear that option B was not logically coherent and elegant enough as a way to relate ordered and unordered arrays (as argued for in the previous comment). Also, the consensus seems to gravitate towards option A for simplicity. It would thus seem that not much progress have been made through the current discussion (but perhaps we have gained more clarity?).

Not fully eager to let go of the logical elegance of option C, but aware of the complexity, I thought a bit more into the situation and an idea came to me: What if we support "all of the above"? Clearly, option B is not preferable, so we should not recommend this, but I believe we could in principle still support it.

(continuing from accidentally submitted comment...)

How?

By realising that the options discussed above (and plenty of others) can be reformulated as a combination of a specific structure together with a specific variant of the digest algorithm (for each level in the data structure).

Given the following seqcol:

https://myserver.org/253d1b/1:  # Adding 1 level of recursion

{
    "lengths": [23456, 12345, 456, 456],
    "names": ["1", "2", "alt_12345", "unplaced_23456"],
    "sequences": ["acb341", "c23ba4", "2314af", "34e838"]
}

We have already (more or less) defined three local variants of the digest algorithm, depending on whether we are on the first, the second or the third level, e.g.:

ga4gh_seqcol_array:

  1. Concatenate array in order with delimiter (comma), e.g.: ["acb341", "c23ba4", "2314af", "34e838"] -> "acb341,c23ba4,2314af,34e838"
  2. Hash string with GA4GH hashing algorithm, e.g.: "acb341,c23ba4,2314af,34e838" -> "7dd615" (btw: I am just making up small digests for simplicity)

ga4gh_seqcol_object:

  1. Sort key/value elements of object by key: {"sequences": "7dd615", "lengths": "450fd9", "names": "8nba65"} -> {"lengths": "450fd9", "names": "8nba65", "sequences": "7dd615"}
  2. Concatenate keys and values with comma, alternating between them: {"lengths": "450fd9", "names": "8nba65", "sequences": "7dd615"} -> "lengths,450fd9,names,8nba65,sequences,7dd615"
  3. Hash string with GA4GH hashing algorithm, e.g.: "lengths,450fd9,names,8nba65,sequences,7dd615" -> "401d99"

ga4gh_refget_sequences:

  1. Use the Refget digest algorithm to digest sequences: "actgatcctacacagacagcttt" -> "acb341"

So my first idea is to formalise the choice of digest algorithm as part of the data structure itself, for each level, e.g.:

https://myserver.org/253d1b:  # No recursion, only top-level digests

{
    "_digest_algorithm": "ga4gh_seqcol_object",
    "lengths": "450fd9",
    "names": "8nba65",
    "sequences": "7dd615"
}

The main advantage of this is that the data structure would then be self-describing in how its digest was created, allowing for a whole new level of automation (which I will return to later). Also, I think the _digest_algorithm string itself must be concatenated together with the other elements when generating the digest. This would make it impossible for two different algorithms to generate the same digest (assuming there will be no hash conflicts), which means that if you have a digest and you are able to fetch the related data from some API you should also be able to uniquely conclude which algorithm was used to create it.

As there are digests at all levels in our design, the _digest_algorithm must be present at all levels. Assuming the API publicly supports queries on any included digests, the next level might look something like this:

https://myserver.org/450fd9:

{
    "_digest_algorithm": "ga4gh_seqcol_array",
    "lengths": [23456, 12345, 456, 456]
}

https://myserver.org/8nba65:

{
    "_digest_algorithm": "ga4gh_seqcol_array",
    "names": ["1", "2", "alt_12345", "unplaced_23456"]
}

https://myserver.org/7dd615:

{
    "_digest_algorithm": "ga4gh_seqcol_array",
    "sequences": ["acb341", "c23ba4", "2314af", "34e838"]
}

This structure to represent arrays is obviously just a suggestion. An advantage of it is, however, that the name of the array is included in the structure itself, making it self-describing. If the name of the array also is included in the digest algorithm, one would make sure that different types of arrays have unique digests (which seems smart).

Going back to the use case of providing digests for unordered sequence collections. Given that we have three seqcols, A, B, and C, which has the same content, but where the sequences are ordered differently. They will then get different digests according to the algorithms described above. However, given that one defines an alternative digest algorithm that disregards the order of the sequences, such an "unordered" algorithm would provide the same digest (say D) for A, B, and C. In a tool setting (e.g. a genome browser), digest D could then be used as a identifier for a "bucket" that collects items as it makes sense in the setting (e.g. representing the unordered coordinate system "hg38") where datasets generated with all ordered seqcols A, B, and C can be visualised together on the same page. It makes sense to use an unordered digest D to represent the genome browser page "bucket", instead of the one of the ordered digests A, B, or C, as the visualisation in any case is sorted first by the canonical sequence names and then by genome position.

Should we provide such an unordered digest algorithm? Perhaps? Perhaps not? But if not, I would argue that we should make it easy for someone else to define it while still following the standard. Formalising the digest algorithm within the data structure would provide this.

sveinugu commented 3 years ago

Formalising the digest algorithm within the data structure would provide this.

Note on the format of the digest algorithm formalisation

I have until now assumed the value to be based on some sort of vocabulary of digest algorithms. However, this is not very useful, as one would then need to keep a register of algorithms with said identifiers. A better solution would be to provide URLs to documents describing the algorithm. But URLs are prone to change, which would either result in dead links or the need to change the URL in the data, but that would change the digest. The solution is, I believe CURIEs, and specifically a DOI identifier to a published document describing the algorithm (e.g. depoyed in https://Zenodo.org)!

nsheff commented 3 years ago

I'm not sold on this. I think I get the point, but personally, I don't see what's the big problem with just determining the order by comparing the order of the constituent elements. Sure, you have to go down one layer to get the elements and compare them, so it's one step more work than if you had a digest to compare... but how often do you really need to ask that question? And, even if that's a question that needs to be asked repeatedly, then why wouldn't you just pre-compute it, and stored in a database?

What you're suggesting here doesn't seem to be accomplishing any more than that, but is doing so by adding much more complexity to the data structure and algorithms. Now you'll have to have a controlled vocabulary for the digest algorithm identifiers like "ga4gh_seqcol_array", and ideally a server somewhere that maintains a mapping of what those terms represent and how to implement them, and different servers would be incompatible if they implement different digest algorithms, etc. These really aren't just 'digest algorthms', which I think of as the checksum function (eg md5sum vs sha, vs ga4gh vs trunc512), but is actually defining an entirely different 'seqcol algorithm' really... Essentially you'd need to introduce an entire new specification, or maybe it's a meta-specification, to govern these separate specifications, right?

To me this is just way too much overhead to warrant what I believe is actually not even really a benefit, since you can solve the question or order from just pre-computing a compatibility comparison that looks at the elements.

sveinugu commented 3 years ago

I'm not sold on this.

Unfortunately I have not had time to complete this, there is much more to the suggestion than just this use case. More will follow...

sveinugu commented 3 years ago

I'm not sold on this. I think I get the point, but personally, I don't see what's the big problem with just determining the order by comparing the order of the constituent elements. (…)

(EDIT Oct 20: added some example content to get the point better across, as I hardly understood myself when re-reading my post some weeks later...)

What makes it difficult for me to let this go is not that our solution makes it a bit more cumbersome to compare two digests, but that our solution does not really provide any digests that can be used as identifiers for a coordinate system to replace what e.g. “hg38” is used for today. Which means that some genome browsers (cough UCSC) probably will not embrace the standard.

So what I believe the current solution entails that some canonical ordering of hg38 will be registered as the golden standard for a browser (possibly lexicographically by sequence names, e.g. "chr1, chr10, chr11,…", as this is easy to carry out automatically). For tools that manage or analyze track files (or scripts/command line usage), all track files (e.g. BED) must carry along a reference genome identifier (put there by e.g. the mapper tool).

Let's say that we have a BED file "hotspots_6afd637fc190.bed" that we know is made from seqcol "6afd637fc190", helpfully provided in an unstructured way in the file name (as there are no metadata field for the reference assembly identifier in the BED format itself). Or at least somewhat better, the tool in question has a "reference_genome" parameter or similar that stores the seqcol digest. So in order to make use of this info in a service that does not care about the ordering of the sequences (say the UCSC genome browser), one needs to run the digest through a particular function (available in a library which wraps a seqcol comparison function). This function loops through all available canonically ordered coordinate systems (stored in a list or fetched from the genome browser), and for each of them, it checks whether the canonical coordinate system is a subset of the seqcol referred to by the seqcol digest recorded with the BED file, with same lengths and names, but possibly in a different order. In which case, one knows that the coordinate system can be used for the analysis or visualisation.

Once this is carried out and a match is found, however, the coordinate system digest that was found cannot really logically replace the currently stored seqcol digest, as it may represent a reordering. So the BED file in our example, cannot be renamed to e.g. "hotspots_35fab9c10384.bed", as the coordinate system "35fab9c10384" explicitly refers to an ordered seqcol that does not match the order in the BED file, which again may cause issues in downstream tools. Currently this "un-FAIR" BED file would have been called e.g. "hotspots_hg38.bed", and this would be ok as tools that require coordinate systems would not care about the order of sequences. Hence, "hg38" in practice means an unordered coordinate system in this context. One could, of course, continue to disregard the order of the sequences in downstream tools, but this would IMO mean that we have defined another standard that doesn't really match the landscape the field of track file analysis/visualisation.

So in practice, one loses the current main purpose of the «reference genome» field, which is for the end user (as well as tools) to easily check directly (by reading the string) whether the file can be uploaded to the genome browser or otherwise be compared with other files of the same coordinate system.

Being able to replace the main ordered reference genome with a digest representing an unordered coordinate system is a loss of information (one loses the sequences and the order), but at least it logically represents the same concept as the current string ("hg38") and does not add incorrect information (another ordering).

With my suggestion, an unordered coordinate system digest would in principle contain information about the canonical ordering as part of the documentation of the digest algorithm (in the part that defines a canonical serialisation of the seqcol object), which I believe is the logically correct place for that piece of information. Also, one can carry out this information reduction offline (removing the sequence array and the order) and it will be independent of particular lists of supported digests for particular genome browsers.

sveinugu commented 3 years ago

Unfortunately, the issue of providing digests for unordered coordinate systems reveals a problem with the current "columnwise" solution with all information stored in arrays, e.g.:

https://myserver.org/253d1b:  # Recursion layer 0 => Level 1
{
    "lengths": "450fd9",
    "names": "8nba65",
    "sequences": "7dd615"
}

https://myserver.org/253d1b/1:  # Recursion layer 1 => Level 2
{
    "lengths": [23456, 12345, 456, 456],
    "names": ["1", "2", "alt_12345", "unplaced_23456"],
    "sequences": ["acb341", "c23ba4", "2314af", "34e838"]
}

Note: there are some inconsistencies in the nomenclature between issues https://github.com/ga4gh/seqcol-spec/issues/10, which describes "recursion layers", and https://github.com/ga4gh/seqcol-spec/issues/21, which describes "levels". Recursion layer 0 results in a digest at level 1, it seems. I believe some cleanup of the nomenclature is in order... In the above, I have used layer and level interchangeably. From here on I will use the word "level" and talk about level 1 and 2 instead of layer 0 and 1.

One problem with the current column-wise solution is that it is not possible to generate a level 0 digest for an unordered sequence collections from a level 1 object with per-array digests. This is true even if such level 1 per-array digests referred to unordered arrays, as this would break the row-wise relationships between the "cells" of information at level 2, which is crucial for generating an unordered digest. Hence, for the possible extension to unordered sequence collections, level 1 as defined today (option A above), is useless. This is also true for both variants of option C above, the first variant which doubles the number of arrays (see https://github.com/ga4gh/seqcol-spec/issues/5#issuecomment-905494394) and the second variant which contains 'ordered' and 'unordered' sub-objects that each follow option A (see https://github.com/ga4gh/seqcol-spec/issues/5#issuecomment-910295030), for the same reason.

The underlying logical fallacy in the above discussion, and the real reason behind all the confusion is that unordered sequence collections really rely on a "row-wise" structure. The same seqcol in a "row-wise" form would look something like this:

https://myserver.org/253d1b:  # Level 1
{
    [
        "f5aa84",
        "399b73",
        "aa937b",
        "4298af"
    ]
}

https://myserver.org/253d1b/1:  # Level 2
{
    [
        {
            "length": 23456,
            "name": "1",
            "sequence": "acb341"
        },
        {
            "length": 12345,
            "name": "2",
            "sequence": "c23ba4"
        },
        {
            "length": 456,
            "name": "alt_12345",
            "sequence": "2314af"
        },
        {
            "length": 456,
            "name": "unplaced_23456",
            "sequence": "34e838"
        },
    ]
}

With such a structure, it would be trivial to create an unordered digest from level 1. In the serialisation algorithm, one would just order the array items (which are all digests) alphanumerically before applying the digest algorithm. However, one would loose all the advantages that the column-wise solution gives, mainly that appending new fields will not affect the relationships that the other arrays are part of.

The main reason that I am still stressing the issue of unordered sequence digests is that it represents a significant argument against a pure column-wise data structure, with a use case that is of some importance (and of major concern to me in my main scientific domain).

nsheff commented 3 years ago

To me, this is another manifestation of what I'm concluding from our order discussions: that there is no straightforward way to add order-ignoring digests. I don't quite follow your argument for why the above options we've discussed fail in this case -- but anyway I agree that switching to your "row-wise" structure is another way to address the order issue.

Either way, all of these solutions we've come up with are very complex, but in my opinion, provide very little practical benefit. To me, there has still not been a satisfactory argument against simply using the compare function to ask questions of order-equivalence.

Our effort trying to get to order-ignoring digests has shown that there's no easy way to do that. In contrast, the compare function is easy to write, easy to use, and provides exactly the same information (actually, provides much more information). Therefore, I have returned to my original position: the digests must be ordered to accommodate the ordered use cases, and use cases that require knowing if there is order-ignoring equivalence should simply use the comparison function to identify that.

The main reason that I am still stressing the issue of unordered sequence digests is that it represents a significant argument against a pure column-wise data structure, with a use case that is of some importance (and of major concern to me in my main scientific domain).

I disagree here. In fact, my use case is I believe in line with yours, that seqcol needs to be useful for coordinate systems, and even unordered coordinate systems. You and I are both working with BED files and interested in identifying compatibility between them. Your proposals all address the issue with order-ignoring digests -- but the compare function also solves this issue! I'm not sure if you've considered if you can address your need just using the compare function.

I suggest taking a bit of a different tack here. I do not think re-opening the question of 'column-wise' vs 'row-wise' is the right answer. Our decision had good rational, such as the independent component digests, which are desirable for the ordered use case which must be the primary function of seqcol. We paused the order/unorder question to work on the compare function in the meetings to see if the compare function could be a workable solution to the use cases we were trying to answer with the order-ignoring digests. So, my suggestion is this: can you make use of the compare function to answer the questions you need to answer about unordered seqcols? The answer is obviously yes, you can, though it might be slightly less convenient, right? But I think if we do the compare function right, it doesn't even have to be much less convenient.

So, the question is: what does it take for the compare function to satisfy your use case? Once that is answered, we can put the complicated and very time consuming order question to rest.

sveinugu commented 3 years ago

@nsheff Thanks for your thorough answer. However, I have tried to clearly point out the main issue in my previous comment above (https://github.com/ga4gh/seqcol-spec/issues/5#issuecomment-934039875): For seqcol to be useful in a FAIR manner for track files, one would need a globally unique identifier that precisely defines a coordinate system, which is in practical use, as well as logically, an unordered sequence collection of lengths and names. A comparison function will not provide an identifier. To put it directly, I would like an identifier that I can put in the assembly_id field in the recently published FAIRtracks draft metadata standard for genomic tracks!

As a side-note, we should probably use assembly_id for the full ordered sequence collection from which the track file was made (e.g. mapped towards), and in addition provide another field named coordinate_system_id or similar to refer to the unordered coordinate system this entails.

In any case, a user with a track file and the related metadata could then simply use the coordinate_system_id to fetch the chromSizes file to use as input to tools that need it. Also, this is the identifier to use in genome browsers or track analysis frameworks, such as LOLA, the Genomic HyperBrowser, or Galaxy, to place related track files in the same bucket, for files that can be compared within the same coordinate system.

Unfortunately, I believe the coordinate system needs to be an unordered one. And the comparison function will not help with providing an accurate metadata identifier for the coordinate system (the creation of which was the main reason I joined this project!).

sveinugu commented 3 years ago

So here's my suggestion:

We support both column-wise and row-wise structures (and any other type of structure really), as well as ordered and unordered serialisation algorithms (and possibly more), and any number of hash algorithms we like. However, we can leave it to others to define some of these if we do not want to.

The only thing we need to add to the above suggestion is a self-describing information about the data structure, something which is already supported by JSON Schema. Collecting all of the above, this may look like follows:

https://myserver.org/253d1b:

{
    "_metadata": {
        "serialisation_alg_name": "ga4gh_seqcol_object",
        "serialisation_alg_id": "doi:12.3456/7890.1",
        "digest_hash_alg_name": "trunc512",
        "digest_hash_alg_id": "doi:12.3456/arewemissingadoiforthis?",
        "schema_name": "ga4gh_minimal",
        "schema_id": "doi:98.7654/3210.1"
    }
    "lengths": "450fd9",
    "names": "8nba65",
    "sequences": "7dd615"
}

And similar at all levels and substructures. All identifiers should follow CURIE form and preferably be actionable through https://identifiers.com or https://n2t.net (names2things). As to the digest_hash_alg_name, there is already a registry for this: https://github.com/ga4gh-discovery/ga4gh-checksum/blob/master/hash-alg.csv (but it seems a DOI or similar identifier for the refget algorithm is missing).

With this, it will be up to the server to select the algorithms and schemas to support. We will provide required lists and optional lists, which can be easily extended. Also, as all records are self-described, one can potentially implement clients that will work regardless of the contents (however, this will require that the algorithm identifiers are actually actionable, which the JSON Schema id already is. I don't know if there are any standards for actionable algorithms with runnable implementations, but if not, there should be).

With this solution, we don't need to implement a solution for unordered coordinate systems now, but someone else will be able to, and it will be easy to do so this later.

EDIT: I suppose the _metadata object (or whatever one would like to call this) can also be digested, but in that case, the serialisation and hash algorithms have to be fixed. If not, we would require an infinite recursion of child _metadata objects, where each _metadata object specifies the algorithms needed to digest its parent! 😄

nsheff commented 3 years ago

For seqcol to be useful in a FAIR manner for track files, one would need a globally unique identifier that precisely defines a coordinate system, which is in practical use, as well as logically, an unordered sequence collection of lengths and names.

I think I disagree here. I think coordinate systems should be ordered. In other words, I'm not convinced that they should be not allowed to have order.

sveinugu commented 3 years ago

I think I disagree here. I think coordinate systems should be ordered. In other words, I'm not convinced that they should be not allowed to have order.

I agree that this is the core question and that it has potentially large implications (detailed in the current issue). I don't think, however, that the answer to this question should be made based on ease of implementation (for us).

I know that both the UCSC Genome Browser and the Genomic HyperBrowser don't care in which order the sequences appear in the input file. Given that we want the possibility of uniquely specifying a coordinate system for each file, there are really three possibilities:

  1. The metadata for each file contains a digest for an ordered coordinate system in the order it appears in that file, which means that two files with logically the exact same contents, differing only in how they are sorted, will have different coordinate systems attached to them. If you sort a BED file, you would then need to recalculate the attached coordinate system digest, which makes our solution a nuisance, as you would need to add this as a step to all the workflows in use out there.
  2. The metadata for each file contains a digest for an ordered coordinate system, ordered in a canonical way, which would probably be in the same order as the reference genome from which it came. The order of the contents of the file might differ from the order of the attached coordinate system, but there is no need to recalculate the coordinate system when sorting a BED file, as in 1. However, this solution requires a coordinate system authority (probably as a seqcol server) to specify the canonical order. Most implementations either follow alphanumeric ordering, natural number ordering (to make sure it's chr1, chr2, chr3,... and not chr1, chr10, chr11,...), the order as specified in an input chromSizes file, or some custom hack. Only tools following the third option (follow the order of the chromSizes file) would work with the solution 2, all others would need to reimplement the internal order logic. I believe for instance the UCSC Genome Browser is in the first category, and follows alphanumeric ordering, and would need to update the implementation (but that should be double-checked). In the Genomic HyperBrowser, we use alphanumeric ordering based on a set of FASTA files for the underlying data structure, while we have implemented chromSizes-based ordering for visualisation purposes, but where the chromSizes file is generated from the FASTA files and the order is defined manually when importing a genome (this is ooold code).
  3. The metadata for each file contains the digest for an unordered coordinate system. Sorting a BED file will not result in any need for recalculating the digest. One can easily calculate the digest from the chromosome length information (independent on its order), there will be no need to contact an authority for this. BED files made from chromSizes files that differ in their order will automatically have the same identifier for the coordinate systems. (Note that the canonical sorting of a BED file is alphanumeric, not following a chromSizes file made by some authority). As the coordinate system is unordered, existing tools do not need to reimplement their order logic. As a chromSizes file IS ordered, a seqcol server might have some issues with locating the correct ordered seqcol if there are several seqcols with different order that map to the same unordered digest. However, in practice the seqcol server will know of the canonical reference genomes and can use this to define the correct ordered coordinate system. This is in any case within their domain and possible for us to include in our standard. The last consequence for solution 3 is that the GA4GH sequence collection standard becomes more complicated. But our map will match the landscape and it would be the defensive solution, making it more probable that our standard will be accepted.

EDIT: Added a few arguments here and there. EDIT2: some more details on the UCSC Genome Browser and the Genomic HyperBrowser

sveinugu commented 3 years ago

One idea I have been thinking about that would solve the order issue, as well as adding the possibility of seqcol subsets. It would also work independently of my metadata ideas as explained above. What if we allow the following:

X:
https//seqcolserver.org/digest/d18f4a/1:
{
    'lengths': [34567, 23456, 12345],
    'names': ['chr1', 'chr2', 'chr3'],
    'sequences': ['acb341', 'c23ba4', '2314af'],
    'topology': ['linear', 'linear', 'linear']
}

Y:
https//seqcolserver.org/digest/86d0a0/1:
{
    'lengths': [456, 567],
    'names': ['alt_12345', 'alt_23456'],
    'sequences': ['32d2bd', 'a793b1'],
    'topology': ['linear', 'linear']
}

Z_ordered:
https//seqcolserver.org/digest/82bd83/1:
{
    'subsets': ['d18f4a', '86d0a0']
}

For unordered coordinate systems could the be encoded by defining each sequence as a subset, as follows:

X1:
https//seqcolserver.org/digest/1a2d53/1:
{
    'lengths': [34567],
    'names': ['chr1']
}

X2:
https//seqcolserver.org/digest/28df89/1:
{
    'lengths': [23456],
    'names': ['chr2']
}

X3:
https//seqcolserver.org/digest/9d2339/1:
{
    'lengths': [12345],
    'names': ['chr3']
}

Y1:
https//seqcolserver.org/digest/13a98d/1:
{
    'lengths': [456],
    'names': ['alt_12345']
}

Y2:
https//seqcolserver.org/digest/dd87a0/1:
{
    'lengths': [567],
    'names': ['alt_23456']
}

Z_unordered:
https//seqcolserver.org/digest/39d93a/1:
{
    'subsets': ['13a98d', '1a2d53', '28df89', '9d2339', 'dd87a0']
}

For Z_unordered, the subset digests are sorted before the final digest is computed. I believe somehow adding some metadata info in the structure that encodes the information that is an unordered seqcol is also needed.

EDIT Dec 1: Removed the sequences and topologies from the coordinate system example, as those arrays wouldn't typically be a part of defining a coordinate system, at least in the horisontal genome browser world.

sveinugu commented 2 years ago

As the above issue is lengthy and full of details, I will try a shorter writeup of arguments for the need and my suggested solution to provide digests for unordered coordinate systems:

So the main feature I have been advocating here is that the seqcol specification should provide globally unique and persistent identifiers for the coordinate system that a track data file (e.g. a BED file) relates to (FAIR principle F1). I have a vested interest in this issue due to being the lead developer of the FAIRtracks draft standard for genomic track metadata (see e.g. a recent blog post). It has been argued that "genome build information is an essential part of genomic track files" (which is incidentally the title of this paper), and since few track data formats include support for including genome build identifiers, at least any metadata standards should. In the context of track data, the exact sequence contents are often of lesser concern than the coordinate system, and tracks made from different patches of the same version of a reference genome are usually analysed together. Hence, the need to uniquely identify coordinate systems. Of specific importance in this regard is the need to somehow denote whether a track follows the UCSC 'chr' naming scheme, something which we in our solution can manage with the name column.

First of all, @nsheff is correct (https://github.com/ga4gh/seqcol-spec/issues/5#issuecomment-949659744) in that the core issue to be determined here is whether this identifier should refer to an ordered or an unordered coordinate system. I argue that for most practical terms, the current vaguely defined identifiers, such as "hg38" are in practice used to refer to a coordinate system where the exact order do not count.

If we were to require that the a coordinate system needs to be ordered, the natural order to pick would be the lexicographic order, for several reasons, one being that this is the order required by the BigBED file format. I believe there are many problems with this approach:

  1. There is little standardisation across the different tools that are implemented for analysis and visualization of genomic tracks on how to manage the order of sequences. There is a practical difference between the order needed for implementation of analysis functionality (usually no order is needed, as the chromosomes in most cases can be analysed independently of each order) and for visualisation (order needs to look good for human eyes). As lexicographic order does not look very good to human readers (i.e. chr1, chr11, chr12...), some tools make use of natural sort ordering schemes, some make use of the order provided by users in chromSizes files, and some might look up the order from official repos (the ENSEMBL browser probably does this).
  1. Let us look at the issue from the perspective of a tool developer that wants to adopt a better scheme for identifying reference genomes, for example in the "select reference genome" selection box of a genome browser (typically the first choice that appears to the user). Given that we only provide ordered coordinate systems, the tool developer would then have to choose an identifier that represents the order required for data upload (most likely lexicographical, if any), the order used in visualization (typically ad hoc solutions), or the order provided by a canonical reference genome, which might not match the two other choices. All these possible identifiers would be confusing to the users. Given the differences in implementation of order, as mentioned above, the field of track analysis would end up having to manage a larger selection of different identifiers than now, something that would make the current situation worse, not better, as I see it.

  2. Given that we only provide ordered coordinate systems, then any time a track data file is uploaded to a genome browser or other service, one would need to use the compare endpoint to compare the coordinate system provided in the track metadata and the coordinate system required by the service. This might cause issues, for instance if the server does not contain both identifiers. Also, for automatic use, such comparison would often need to be carried out in a for loop where the coordinate system digest associated to a track data file is compared with the digests of all the reference genomes supported by the service. Due to this cumbersomeness, I would presume that the seqcol standard would see little adoption in this context, as it would be easier to just look at the file contents instead (which is often what is being done today anyway). For workflow systems and other solutions that need to match data files with supporting tools based on metadata information only, then looking at the data content is by definition not a solution.

  3. Conversely, if we provide solutions to create identifiers for unordered coordinate systems, I strongly believe the field would converge into only two such identifiers for each reference genome, one with the 'chr'-names and one without. For visualization, one would simply select a fitting canonical ordered seqcol as a blueprint. In our example, the tool developer would need to decide whether to present to the user the list of identifiers supported for data import or for visualisation (or both, which is possible since there will be a 1-1 relation between the two list. This is again due to the small likelihood that the tool will include support for two different orderings of the same genome).

  4. An unordered identifier can be calculated directly from an ordered seqcol or other reference genome representation without the need of service lookup, which would be highly practical and would increase adoption. For the opposite direction, we or someone else could implement functionality for querying with an unordered coordinate system identifier and receive the corresponding list of canonical ordered seqcols. However, I am not sure whether such reverse lookup would be much needed in practice.

  5. In addition to the need for identifiers for unordered coordinate systems, there is a practical need for a solution to automatically extract the smaller list of the "primary" sequences from the full list of references, as most track analysis/visualisation tools ignore many of the extra sequences (such as alternative sequences, the mitocondrion sequence, or similar). Having integrated support for subsets in the seqcol specification would be a useful way to solve this and other subset-related issues that the current idea does not solve.

This comment is a previous attempt at arguing for the need for a schema for unordered coordinate system identifiers.

I will follow up with a writeup of the current suggestion for supporting identifiers for unordered coordinate systems as well as seqcol subsets.

nsheff commented 2 years ago

Can you clarify what you mean by "track file" ? In my experience this is not a very commonly used term and I'm not exactly sure what you mean. Either defining it or using a more frequent term would help me understand.

sveinugu commented 2 years ago

How I suggest we solve the issue:

First, there doesn't seem to be any solutions for generating a digest for an unordered seqcol from our current level 1 array digests. What is needed is to somehow add support for row-wise data storage at level 2 in addition to the current column-wise solution. Here is an updated suggestion on how to achieve this:

Given the following ordered seqcols X and Y.

X:
https//seqcolserver.org/collection/d18f4a/1:
{
    'lengths': [34567, 23456, 12345],
    'names': ['chr1', 'chr2', 'chr3'],
    'sequences': ['acb341', 'c23ba4', '2314af'],
    'topology': ['linear', 'linear', 'linear']
}

Y:
https//seqcolserver.org/collection/86d0a0/1:
{
    'lengths': [456, 567],
    'names': ['alt_12345', 'alt_23456'],
    'sequences': ['32d2bd', 'a793b1'],
    'topology': ['linear', 'linear']
}

Say we want to specify a seqcol Z that has X and Y as subsets, in that order. This could be done as follows:

Z:
https//seqcolserver.org/collection/82bd83/1:
{
    subset_names: ['Y', 'X'],
    subset_order: [1, 0],
    sorted_subsets: ['86d0a0', 'd18f4a']
}

subset_names would be needed in order to allow tools to easily extract particular subsets, say the primary chromosomes. sorted_subsets are the digests of the sequence collections representing the subsets, in lexicographically sorted order. subset_order contains the indexes of the subsets in the order represented by seqcol Z.

So my idea is that this solution to the subset problem also represents a solution to the problem of representing unordered coordinate systems. First, one would need a way to represent the case where the subsets are unordered (which we in our example will call Z*). The solution to this is trivial; it is just to remove the subset_order column:

Z*:
https//seqcolserver.org/collection/82bd83/1:
{
    subset_names: ['Y', 'X'],
    sorted_subsets: ['86d0a0', 'd18f4a']
}

And this is really also the only thing needed in terms of data representation to solve the issue of providing digests for unordered sequence collections. The rest of the solution can be implemented on the data side by adding sequence collections where each subset only contains a single sequence, e.g. as illustrated with the unordered seqcol W:

X1:
https//seqcolserver.org/collection/1a2d53/1:
{
    'lengths': [34567],
    'names': ['chr1']
}

X2:
https//seqcolserver.org/collection/dd87a0/1:
{
    'lengths': [23456],
    'names': ['chr2']
}

X3:
https//seqcolserver.org/collection/28df89/1:
{
    'lengths': [12345],
    'names': ['chr3']
}

Y1:
https//seqcolserver.org/collection/13a98d/1:
{
    'lengths': [456],
    'names': ['alt_12345']
}

Y2:
https//seqcolserver.org/collection/9d2339/1:
{
    'lengths': [567],
    'names': ['alt_23456']
}

W:
https//seqcolserver.org/collection/39d93a/1:
{
    'subset_names': ['alt_12345', 'chr1', 'chr3', 'alt_23456, 'chr2']
    'sorted_subsets': ['13a98d', '1a2d53', '28df89', '9d2339', 'dd87a0']
}

What is lost in this solution is the possibility of support subsets also for unordered coordinate systems. One could possibly allow two levels of subsets, but I think that would complicate everything. In practice, one would just make use of the subsets of the corresponding ordered seqcols instead, if defined.

Of course, supporting this suggestion would have consequences for the other endpoints, but I am quite sure it would not be very difficult to sort that out. As argued in the comment above, I believe it would be worth it. Also, one can add other unrelated arguments for supporting subsets, for instance that this would make it possible to see the differences between sequence collections at the subset level, which I believe will be enough for many cases. It will probably also reduce the total storage size needed for a server, due to the current redundancy at the subset level.

sveinugu commented 2 years ago

Can you clarify what you mean by "track file" ? In my experience this is not a very commonly used term and I'm not exactly sure what you mean. Either defining it or using a more frequent term would help me understand.

E.g. a BED/WIG/GFF/BigBED/BigWIG/VCF/... file

sveinugu commented 2 years ago

Can you clarify what you mean by "track file" ? In my experience this is not a very commonly used term and I'm not exactly sure what you mean. Either defining it or using a more frequent term would help me understand.

E.g. a BED/WIG/GFF/BigBED/BigWIG/VCF/... file

I have tried to clarify this in the post by using the term "track data file" instead and exemplify this with the BED file format. I don't know what else to call such files? Do you have any other suggestions?

nsheff commented 2 years ago

the main feature I have been advocating here is that the seqcol specification should provide globally unique and persistent identifiers for the coordinate system

I agree with you 100%, and I'm fully on board with that. So, we definitely share the vision for the main point of this....but I just fundamentally disagree on a few of the points, notably:

the core issue to be determined here is whether this identifier should refer to an ordered or an unordered coordinate system. I argue that for most practical terms, the current vaguely defined identifiers, such as "hg38" are in practice used to refer to a coordinate system where the exact order does not count.

Here, my opinion diverges. In my opinion:

  1. the identifier should refer to an ordered coordinate system.
  2. The current vaguely defined identifiers, like "hg38", are not "in practice used to refer to a coordinate system where the exact order does not count". In fact, this depends on the use case, and some use cases do require order, so the final digest must reflect order.

But set that aside for a moment. If I go along that coordinate systems should be defined as unordered... even if I do that, to me there is one major point that has not gotten enough attention: even if we do provide an unordered digest, I do not believe this will solve the primary purpose of this, which is to "provide globally unique and persistent identifiers for the coordinate system"

The reason is this: in my experience, the most frequent incompatibility between coordinate systems are NOT due to order mismatches, but instead, to differing sequence names or collection elements contained. For example, all the commonly used "hg38" references in the wild today differ in terms of collection elements and sequence names. None of them is simply an order-differing variant of another. Therefore, an order-invariant digest will be of no use in this situation, since their order-invariant identifiers will not match anyway.

To me, the way to solve this problem is not to mint new identifiers that lack order. Instead, the problem can be addressed through the comparison function, which has the potential to address not only the order question, but all those other compatibility questions as well. I acknowledge that it's less convenient to have to execute a function instead of just looking for a perfect digest match -- however, the function better reflects the complexity of the problem we're dealing with. Those incompatibilities are beyond order and are simply too many things to be solved by unique identifiers. It requires something more powerful. That's the compatibility function.

So instead of having people rely on the identifier to convey everything they need, I'm arguing that we need to move people to to using the compatibility function. If it's super easy and lots of tools built around the return value of /comparison, then this is almost as convenient as a single identifier that you can look at and just know. The relationships among coordinate systems will not suddenly become simplified with order-invariant digests. Order-invariant digests will not solve many problems in practical terms. We will still need the compatibility function to do the majority of the comparisons.

So, my conclusion is we should not bother adding the complexity of multiple types of digests, and instead, make the comparison function as fast and useful as possible, since that's what people are going to have to be using anyway.

A few more thoughts about why the unordered identifiers won't be used:

nsheff commented 2 years ago

Can you clarify what you mean by "track file" ? In my experience this is not a very commonly used term and I'm not exactly sure what you mean. Either defining it or using a more frequent term would help me understand.

E.g. a BED/WIG/GFF/BigBED/BigWIG/VCF/... file

I have tried to clarify this in the post by using the term "track data file" instead and exemplify this with the BED file format. I don't know what else to call such files? Do you have any other suggestions?

To me, "track" refers to a row in a genome browser. I think a BED file can be represented as a track in a genome browser, but it can also just be a BED file, which I use in an analysis independent of a genome browser. As soon as you say "track data file" you're talking from the perspective of a genome browser. But since 99% of my use of these data files is not in conjunction with a genome browser, I find the terminology confusing. Those files are obviously also very useful outside of genome browsers. So, I wouldn't refer to them as tracks unless you're specifically talking about genome browsers. I usually refer to the files by data type, like "genomic interval data" , but that's specific to BED-type and you're looking for a more generic term that also includes wiggle-type tracks... I'd probably say something like "genome signal/region annotation" files, or something like that.

sveinugu commented 2 years ago

The reason is this: in my experience, the most frequent incompatibility between coordinate systems are NOT due to order mismatches, but instead, to differing sequence names or collection elements contained. For example, all the commonly used "hg38" references in the wild today differ in terms of collection elements and sequence names. None of them is simply an order-differing variant of another. Therefore, an order-invariant digest will be of no use in this situation, since their order-invariant identifiers will not match anyway.

So I definitely agree with this experience, which is I believe our current idea is a great attempt at providing a solution to the situation. So when "hg38" is attached to e.g. a BED file, this typically can mean anything in terms of collection elements and sequence names, but typically not order. And the reason for this is this context (file metadata), you would primarily interested in preserving the provenance (which reference genome was used to create this file).

My argument is that the use of the term "hg38" changes slightly in meaning in the context of analysis and visualisation tools. Here, you would no longer be so interested in provenance, other than wanting to maintain the existing metadata all the way to the results (and also presumably adding info about the analysis parameters, etc.). But in the analysis or visualisation itself, you are interested in the mathematical model of the space within which your data is defined, and whether this space inherent in your data files are compatible with other data files or the model inherent in the analysis/visualisation.

So what I now see more clearly is that we are really talking about two different things:

A) One is an identifier representing a sequence collection thet refer to a record in a repository. This is the main scope of the seqcol standard. The ordered coordinate system as represented in a chromSizes file is an extract of this, and I agree that one would always want to carry this identifier along with a data file.

B) The other thing is an identifier for the data model that is needed to make mathematical/statistical sense of the track data. Basically, the coordinate space in which the data file is defined. This model needs to be unordered, and should also really also be extended to include information such as gaps in the assembly and other gaps where data is missing. This is an analysis-oriented model in contrast to a storage-oriented model.

Coordinate systems represent chromSizes files. These have an order. even if we think of the abstract "coordinate systems" as lacking order... we still want to refer to chromSizes files with these identifiers, and they do have an order. To me, these are the real representation of coordinate systems.

So to me, a "coordinate system" is a word I would use for B and for B only. It is a mathematical term. A chromSizes file a representation of a coordinate system, not the other way round. I would also say it is really not the best representation, precisely because it contains an order which is in practice often discarded anyway for analysis (but retained for visualization/results). In addition, a mathematical model of the space in which the data file is defined should really also have information about missing data, the most important of which is the centromere regions, but also other repeating regions, etc. We have been discussing some of these aspects before, but have tended to define them out of scope. I do understand that we cannot solve everything with this, but having the possibility of order-invariant sequence collections would allow for the seqcol standard to be used to also represent a coordinate system in the mathematical sense, at least in a very basic form.

sveinugu commented 2 years ago

To me, "track" refers to a row in a genome browser. I think a BED file can be represented as a track in a genome browser, but it can also just be a BED file, which I use in an analysis independent of a genome browser. As soon as you say "track data file" you're talking from the perspective of a genome browser. But since 99% of my use of these data files is not in conjunction with a genome browser, I find the terminology confusing. Those files are obviously also very useful outside of genome browsers. So, I wouldn't refer to them as tracks unless you're specifically talking about genome browsers. I usually refer to the files by data type, like "genomic interval data" , but that's specific to BED-type and you're looking for a more generic term that also includes wiggle-type tracks... I'd probably say something like "genome signal/region annotation" files, or something like that.

This is getting a bit out of scope, but IMHO I think you just proved my point that there isn't really any better terms to use than tracks, or possibly "track (data) files", also for non-visual analysis. Here is a bit of text I just wrote for the new FAIRtracks.net website (under construction), containing my core argument:

The term "track" implies a mental visual model of elements located along some line. In the domain of genomics this typically relates to a coordinate system defined by a reference genome sequence collection. This is a crucial model to keep in mind also for non-visual analysis scenarios. There exists, to our knowledge, no other term in broad use which is both general and precise enough to cover the same heterogeneity of data content, data representation and analysis scenarios as the term "track" (or a variation thereof). Hence, the FAIRtracks team advocate the use of the term "track" also in analysis scenarios outside the realm of genome visualization.

Forgetting to think of a BED file as a track that follows the genome can have large consequences for statistical analyses, for instance by disregarding the importance of the clustering aspect of the elements along the coordinate system.

Not sure if I will be able to win you over here, but at least I'm trying, one bioinformatician at a time...

nsheff commented 2 years ago

Yes I agree there's not a great existing term for that. But to me you're trying to create a new use for an old term. If you can do it, great! I am not arguing against that, just telling you how I interpreted the term. I interpreted it differently from how you used it, and as a result, I am confused when I read some of what you write. For terms like this not in common use, if you want to use them in a way different from what others do, then... you have to make that clear and define them when you use them, and it's going to cause some confusion, until you succeed in training the whole community to use the term in that way.

Personally I think this particular one could be a losing battle -- 'track' is a very old term that is already embedded, at least in my mind, from decades of genome browsers. Even now that I understand what you mean, I can't think of it naturally that way. I've used that term way too much to mean something else. Maybe too late but I'd suggest it may be easier to use a new term that conveys the idea you're trying to convey, rather than using one that is already embedded in the community with a different meaning. But if you can convince everyone to broaden the meaning of track, then so be it!

sveinugu commented 2 years ago

I have now realized that supporting unordered sequence collection digests in the way I argue for in the comment above (https://github.com/ga4gh/seqcol-spec/issues/5#issuecomment-994821465) could be implemented as a convention in a specialized server using the current seqcol standard, only with the addition of a custom array. Let's name this array sorted_element_digests. The values of this column could then contain the digest of each "row" of the collection, as if there were a seqcol containing only that one row as a sequence ( or any other unique digest for a row would do, really ). Then it's only a matter of making sure that all seqcols are sorted by the sorted_element_digests column. Using the above example, an unordered coordinate system sequence collection would then look like this:

https//seqcolserver.org/collection/1da87b/1:
{
    'lengths': [456, 34567, 12345, 567, 23456],
    'names': ['alt_12345', 'chr1', 'chr3', 'alt_23456', 'chr2'],
    'sorted_element_digests': ['13a98d', '1a2d53', '28df89', '9d2339', 'dd87a0']
}

This specialized server could then crawl other seqcol servers and generate and register new unordered coordinate system seqcols, probably by only looking at the lengths and names arrays. Any seqcol with the same content in the names and lengths arrays as in the above example would give rise to the same unordered seqcol with digest 1da87b, independent of the original order of the sequences.

It would, however, be useful if the seqcol standard included an array with subset information, priority, or similar, so that one could extract only the core sequences if one wanted.

With this, I think we could finally close the issue of unordered seqcols, as I have here sketched an unofficial way to support unordered sequence collection digests in a seqcol-compliant server by just adding one additional array with additional constraints and logic.

sveinugu commented 2 years ago

Yes I agree there's not a great existing term for that. But to me you're trying to create a new use for an old term. If you can do it, great! I am not arguing against that, just telling you how I interpreted the term. I interpreted it differently from how you used it, and as a result, I am confused when I read some of what you write. For terms like this not in common use, if you want to use them in a way different from what others do, then... you have to make that clear and define them when you use them, and it's going to cause some confusion, until you succeed in training the whole community to use the term in that way.

Personally I think this particular one could be a losing battle -- 'track' is a very old term that is already embedded, at least in my mind, from decades of genome browsers. Even now that I understand what you mean, I can't think of it naturally that way. I've used that term way too much to mean something else. Maybe too late but I'd suggest it may be easier to use a new term that conveys the idea you're trying to convey, rather than using one that is already embedded in the community with a different meaning. But if you can convince everyone to broaden the meaning of track, then so be it!

Well, we have been publicly arguing for such use of the term "track" since 2010 (https://doi.org/10.1186/gb-2010-11-12-r121) and did use it internally in this way 2-3 years before that. Also the community of scientists working with method development for track data is not THAT big!...

Anyway, let's leave it there.

nsheff commented 2 years ago

Any seqcol with the same content in the names and lengths arrays would give rise to the same unordered seqcol with digest 1da87b, independent of the original order of the sequences.

I'm trying to get my head around your suggestion here. If I understand correctly, this hearkens back to the idea of a 'canonical order' for arrays. Here the canonical order is found by: 1. taking each element and digesting across arrays, and then sorting that digest. This contrasts with the earlier proposals which based the order on lexographically sorting the actual array values, which led to challenges when some arrays are not present. So I think your suggestion is good as it could get around some of those limitations we faced with specifying a canonical order.

If fact, I believe you wouldn't even need the sorted_element_digests array. If you just define the canonical order in this way, can't you simply order them in that way and by so doing compute unordered digests?

sveinugu commented 2 years ago

If fact, I believe you wouldn't even need the sorted_element_digests array. If you just define the canonical order in this way, can't you simply order them in that way and by so doing compute unordered digests?

Yes, good point. That recommendation for creating unordered digests might even make it into the standard, as it will only be a piece of text and not have any consequences for normal "ordered" scenarios.

nsheff commented 2 years ago

Above, you detailed an issue with the canonical ordering idea,

any change in the arrays used as secondary keys in the canonical sorting algorithm will trigger an order change in the other columns (except in the length column). Even though these are edge cases, this makes it impossible to make use of digest changes in comparison logic, which was the main argument for option B in the first place.

Does this new way of digesting across arrays solve this? Or does this problem not occur for some other reason?

sveinugu commented 2 years ago

Above, you detailed an issue with the canonical ordering idea,

any change in the arrays used as secondary keys in the canonical sorting algorithm will trigger an order change in the other columns (except in the length column). Even though these are edge cases, this makes it impossible to make use of digest changes in comparison logic, which was the main argument for option B in the first place.

Does this new way of digesting across arrays solve this? Or does this problem not occur for some other reason?

So the previous issue was really about the idea of adding a separate order array, with the assumption that the other arrays would then follow some form of canonical order. The idea was then that any change in order would be reflected only in the order array, while change in content would be reflected in the particular array with the change. In both cases we focused specifically on the semantics of changes in the per-array digests. However, we found it impossible to create a canonical ordering algorithm that would allow this. The issue you raise is that my current idea would also fail in this regard, which it will. However, I believe the main scenario has now changed, and I don't think that matters anymore.

In this proposal, we do not include a order column and do not need to consider the interpretation of that. One of the original ideas was that one should be able to see whether two sequence collections were in order merely by comparing the digests of the order array. However, that did not work out. Instead, the goal now is merely to provide a deterministic way of calculating an order-invariant top-level digest. We were previously also trying to define a canonical ordering through ordering the arrays based on the semantics of the arrays, with the two top contenders being "lengths -> names -> sequences" and "lengths -> sequences -> names". Leaving the semantics out of the scenario means that we really just need to find any kind of canonical sorting algorithm, and since we clearly define a digest algorithm, it is a simple solution to define the order based on this algorithm as defined above.

In conclusion: We have come to this relatively simple suggestion based on a long detour and have collected some baggage on the way that we do not really need to take into consideration if we simplify the issue, as far as I am able to discern. Please let me know if I am thinking incorrectly now.

nsheff commented 2 years ago

Would this require the names column to exist and be unique? I believe it may... or at least it would simplify things.

sveinugu commented 2 years ago

Would this require the names column to exist and be unique? I believe it may... or at least it would simplify things.

No, I don't think so. Total duplications are not a problem, I believe, as the per-sequence digests will be the same and thus uniquely ordered.

nsheff commented 9 months ago

The discussion of order eventually became #19, #52, and others, which in short concluded: