Open tcezard opened 1 year ago
I like the simplicity that requiring an array at level 2 provides. And also the simplicity that the digest algorithm is always used to go from level 2 to level 1. If for instance the value at level 2 is a string, this could still be a novel-length string, while the digest at level 1 will always be digest-size.
However, one solution to your issue might be to add an array property that specifies the level of expansion, so that you can stop expanding your sorted-sequences
attribute at level 1.
We can definitely discuss this. I can share how I implemented it in my demo implementation, in case it's useful:
So in your case, I would have used a vanilla string for "single-value-attribute" -- and I would not have digested this, so there would be no recursion to retrieve the value. The way I did this was actually using sveinung's second insight: I have a property in the schema that indicates which attributes are digested. This is basically explained in my henge tutorial.
While I can see some rationale in the simplicity of just saying "everything is an array, and everything gets digested" -- I think this is unnecessary overhead for many single-value attributes.
I have a property in the schema that indicates which attributes are digested. This is basically explained in my henge tutorial.
Perhaps this should also be in the standard?
- I did not require all of the level 2 attributes to be stored as arrays. I allowed attributes to be stored as whatever type makes sense.
- I did not require digesting each element, from level 2 to level 1.
I personally think this is what makes the most sense. But this would raise the issue of how to declare such attribute in the service info. Right now we can leverage the json schema type to declare what type of data the attribute contains
lengths:
type: array
collated: true
description: "Number of elements, such as nucleotides or amino acids, in each sequence."
items:
type: integer
single-value-attribute:
type: string
collated: false
description: ""
collated
flag here is not very informative but it is mandatory so we probably want to keep it.arrays
and elements
sections in the return would not apply. We can rename arrays
to attributes
or similar to make it more relevant but we also need something that test the equality of the values.At one point I was using the keyword digested
, kind of like collated
, so, eg:
lengths:
type: array
collated: true
digested: true # maybe left off as default?
description: "Number of elements, such as nucleotides or amino acids, in each sequence."
items:
type: integer
single-value-attribute:
type: string
collated: false
digested: false # could be specified here, when necessary?
description: ""
For comparison: I propose we make these changes to the comparison function result:
arrays
to attributes
. Then: attributes
will show if the single-valued attribute exists in both collections (it works for both collated an uncollated or single-value elements).elements
to collated_elements
, since it seems to correspond only to collated elements. Then, we may want to introduce new functionality to the comparison result, such as other_elements
, to handle some basic comparison for anything that is not listed as a collated
. This could potentially include: 1. uncollated arrays; 2. single-value attributes. But, maybe we're getting too deep here, and the comparison spec should be limited as above.
My notes from Oct 18th say what we came to was:
Changes to be made to comparison function:
array_elements
on arrays (all arrays, collated or not).This makes the comparison function terminology more in-line with collections that include single-value attributes.
The only other thing to decide is: how do you know what to digest and what to not digest? I guess there are two possibilities:
array
or object
values, but don't digest anything with a primitive type
Would the latter work? It would prevent you from digesting a singleton.
Is this just an implementation detail or must this be part of the spec?
Here are 3 ways I implemented to show a new compare result
Our decision was to adopt option 3 with one change: add _count
, to the array elements, so it's a_count
, b_count
, and a_and_b_count
.
@tcezard is this issue solved, to the point that this can be closed?
I would like to store metadata attribute that only have single values in a sequence collection. How do we see them being represented in the JSON at level 1 and level 2.
Represent them similarly to other attributes
At level 2, are we storing them in a single value array anyway?
or directly plain text ?
Then at level 1 they would be digested similarly to the other attributes?
Represent them as single value in every level
Alternatively they could be expose in plain text directly at level 1 and level 2 with not changes Level 2
Then level 1
Comparison
The comparison result seems highly dependent on the representation at level2. if we chose the level 2 representation in a single value array then the comparison can be done in the same way as with the other attributes. Other representation might require different infrastructure.
Use cases
There are many use case for single value attributes like
assembly-accession
ornaming-authority
But the one use case I have in mind is to store
sorted-sequences
as a single level1 digest. Since I won't need the detail of sequences already stored in thesequences
attribute I can relatively cheaply have a order relaxed comparison on any attribute by comparing the the level1 digest and not store the underlying array.