ga4gh / refget

GA4GH Refget specifications docs
https://ga4gh.github.io/refget
14 stars 7 forks source link

How to store and represent and compare non collated single value attributes in a sequence collection #57

Open tcezard opened 1 year ago

tcezard commented 1 year ago

I would like to store metadata attribute that only have single values in a sequence collection. How do we see them being represented in the JSON at level 1 and level 2.

Represent them similarly to other attributes

At level 2, are we storing them in a single value array anyway?

{
  ...
  "length": [10, 20],
  "single-value-attribute": ["test"]
}

or directly plain text ?

{
  ...
  "length": [10, 20],
  "single-value-attribute": "test"
}

Then at level 1 they would be digested similarly to the other attributes?

{
  ...
  "length": "8djrpzjdbsoeghbadoadq.",
  "single-value-attribute": "psuhfbsjwttzaywhdjsid"
}

Represent them as single value in every level

Alternatively they could be expose in plain text directly at level 1 and level 2 with not changes Level 2

{
  ...
  "length": [10, 20],
  "single-value-attribute": "test"
}

Then level 1

{
  ...
  "length": "8djrpzjdbsoeghbadoadq.",
  "single-value-attribute": "test"
}

Comparison

The comparison result seems highly dependent on the representation at level2. if we chose the level 2 representation in a single value array then the comparison can be done in the same way as with the other attributes. Other representation might require different infrastructure.

Use cases

There are many use case for single value attributes like assembly-accession or naming-authority

But the one use case I have in mind is to store sorted-sequences as a single level1 digest. Since I won't need the detail of sequences already stored in the sequences attribute I can relatively cheaply have a order relaxed comparison on any attribute by comparing the the level1 digest and not store the underlying array.

sveinugu commented 1 year ago

I like the simplicity that requiring an array at level 2 provides. And also the simplicity that the digest algorithm is always used to go from level 2 to level 1. If for instance the value at level 2 is a string, this could still be a novel-length string, while the digest at level 1 will always be digest-size.

However, one solution to your issue might be to add an array property that specifies the level of expansion, so that you can stop expanding your sorted-sequences attribute at level 1.

nsheff commented 12 months ago

We can definitely discuss this. I can share how I implemented it in my demo implementation, in case it's useful:

  1. I did not require all of the level 2 attributes to be stored as arrays. I allowed attributes to be stored as whatever type makes sense.
  2. I did not require digesting each element, from level 2 to level 1.

So in your case, I would have used a vanilla string for "single-value-attribute" -- and I would not have digested this, so there would be no recursion to retrieve the value. The way I did this was actually using sveinung's second insight: I have a property in the schema that indicates which attributes are digested. This is basically explained in my henge tutorial.

While I can see some rationale in the simplicity of just saying "everything is an array, and everything gets digested" -- I think this is unnecessary overhead for many single-value attributes.

sveinugu commented 12 months ago

I have a property in the schema that indicates which attributes are digested. This is basically explained in my henge tutorial.

Perhaps this should also be in the standard?

tcezard commented 11 months ago
  1. I did not require all of the level 2 attributes to be stored as arrays. I allowed attributes to be stored as whatever type makes sense.
  2. I did not require digesting each element, from level 2 to level 1.

I personally think this is what makes the most sense. But this would raise the issue of how to declare such attribute in the service info. Right now we can leverage the json schema type to declare what type of data the attribute contains

lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
single-value-attribute:
    type: string
    collated: false
    description: ""
nsheff commented 11 months ago

At one point I was using the keyword digested, kind of like collated, so, eg:

lengths:
    type: array
    collated: true
    digested: true  # maybe left off as default?
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
single-value-attribute:
    type: string
    collated: false
    digested: false  # could be specified here, when necessary?
    description: ""

For comparison: I propose we make these changes to the comparison function result:

Then, we may want to introduce new functionality to the comparison result, such as other_elements, to handle some basic comparison for anything that is not listed as a collated. This could potentially include: 1. uncollated arrays; 2. single-value attributes. But, maybe we're getting too deep here, and the comparison spec should be limited as above.

nsheff commented 10 months ago

My notes from Oct 18th say what we came to was:

Changes to be made to comparison function:

This makes the comparison function terminology more in-line with collections that include single-value attributes.

The only other thing to decide is: how do you know what to digest and what to not digest? I guess there are two possibilities:

  1. specify it in the schema; or
  2. digest any array or object values, but don't digest anything with a primitive type

Would the latter work? It would prevent you from digesting a singleton.

Is this just an implementation detail or must this be part of the spec?

nsheff commented 10 months ago

Here are 3 ways I implemented to show a new compare result

Option 1: attribute first

image

Option 2: collection first

image

Option 3: collection first without 'total' keyword

image

nsheff commented 9 months ago

Our decision was to adopt option 3 with one change: add _count, to the array elements, so it's a_count, b_count, and a_and_b_count.

nsheff commented 7 months ago

@tcezard is this issue solved, to the point that this can be closed?