Rich type representation in Attributes

ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24

http://ga4gh.org

Apache License 2.0

214 stars 110 forks source link

Rich type representation in Attributes #700

Closed david4096 closed 7 years ago

david4096 commented 8 years ago

In the DWG call today the question came up about how to perform message typing in a more dynamic manner. You can see one way of performing this task here by using the protobuf oneof field and reflection. This PR attempts to close #607 by adding richer type representation, and also close #445 by adding integers as one of the available types. Big thanks to @calbach and @diekhans for providing the basis for this in Sequence Annotations, this PR extends that work to the rest of the data model.

The info field is used on many GA4GH messages and acts as a way to "tag" a message with data that are have not been represented by the named fields. The field items are lists of internal protobuf types, and their type information is retained with the message itself. This allows code to inspect the info field and perform tasks depending on the type.

With Sequence Annotations we saw the addition of a similar spirited effort that uses GA4GH types in a protobuf oneof field to allow typed information to be stored in a Feature message. This improves data interchange by allowing ExternalIdentifier or OntologyTerm messages to be used as values in the attributes map.

This PR attempts to unify the approach by including the strengths of the ListValue method used by the info field with the extensibility of the Attributes approach from the Features protocol.

To carry this out, the Attributes message is moved to metadata.proto so that it can be used amongst the other protocols. The info field was replaced for messages that had it with the field attributes. I've renamed the vals field to attr, since I found the repetition of vals...values confusing in code.

I've also expanded the available types in an attribute value message (int32, int64, bool, double, null). This list can be extended with other GA4GH messages as needed.

Added attributes to the Variant Annotation Set message and RNA expression messages.

Please view the iPython notebook example that shows how to work with the attributes field. More information on the oneoffield can be viewed here.

I've included an example below of how JSON serialization of the attributes field as it appears for the variant in json_intro.rst. I hope this addresses any concerns that fidelity of the data is lost when using JSON serialization:

      "attributes": {
          "attr": {
              "numberOfPapers": {
                  "values": [
                      {"int32Value": 11}
                  ]
              },
              "variantFacts": {
                  "values": [
                      {"stringValue": "is_long"},
                      {"stringValue": "is_interesting"}
                  ]
              }
          }
      }

andrewjesaitis commented 8 years ago

+1 I really like this idea.

First, since one of the languages of the genetics community (python) is untyped anything that brings some degree of type integrity is welcome.

Second, standardizing this method of creating space for secondary (and unforeseen) attributes that people need to store is a big plus as well.

One shortcoming that was discussed was the storage of arbitrary object types, but this isn't supported any better in the current implementation. In either case, users would need to serialize their object to text and deserialize it later to work on it. I think this is not a primary use case for the info and the vast majority of time users will be storing primitive types.

david4096 commented 8 years ago

Thanks @andrewjesaitis, that's right! You'll still be able to store arbitrary JSON strings in string_value messages as is done now.

mcourtot commented 8 years ago

On the MTT side, we have almost exactly the same structure:

// characteristics object uses lists of BioCharacteristic objects to describe // diseases and phenotypes associated with this Individual. Characteristics characteristics = 5;

// BioCharacteristic is a prototype wrapper object for single instances // of phenotypes, diseases ... which may be described through one or several // ontology terms message BioCharacteristic { string description = 1; repeated OntologyTerm ontologyTerms = 2; }

// Characteristics message Characteristics { repeated BioCharacteristic diseases = 1; repeated BioCharacteristic phenotypes = 2; }

We're happy updating the names of the attributes if this is helpful.

@mbaudis created a new branch, see https://github.com/ga4gh/schemas/blob/metadata-bio-characteristics/src/main/proto/ga4gh/bio_metadata.proto. It would be good to coordinate, use it across and merge the branch.

mcourtot commented 8 years ago

To clarify, we propose to add an object of type biocharacteristic to the values allowed for attributes.

ejacox commented 7 years ago

How often are typed values needed? Is this overkill?

ejacox commented 7 years ago

My question gets at documenting the motivation for this change. The discussion is scattered in several PRs and issues. I will try to summarize: The motivation is to be able to precisely represent the data and to be able to recreate the original state of the data (roundtrip), e.g. if passing variants from a vcf file, then be able to recreate the vcf file from the data in the message. Therefore, there is a need to have lists of typed values that can be scalar or array values. Also, the data can be hierarchical.

The drawback of this approach is that it is complicated and will require more complex processing just to get a simple value. Will this impede acceptance of GA4GH by consumers of the data? Is the goal of roundtrip even attainable? The formats are not even precisely defined (see #505).

My impression of the info field was that it is for metadata, most often a string value. What if we keep info as well as adding attributes where it is necessary?

david4096 commented 7 years ago

Thanks @ejacox, great questions! The basic problem is that folks want to be able to tag with Ontology Terms, when they're available, and define loosely structured hierachical metadata. The other messages made available as AttributeValues (experiment, program, analysis) are aspirational and provided for completeness and don't interfere with the normal use cases.

The attributes message as it as used in sequence annotations provides the highest fidelity of interchange, and this PR brings that feature to the rest of the API. It answers the question, "what do I do with my JSON" in as final of a manner as we can do at this time.

In terms of keeping the info field, the attributes message is the same for the use case of the info field. I chose to rename it because I believe it's semantically more clear. The differences from the current info field are being able to nest keys and use some GA4GH messages as tags. https://github.com/ga4gh/schemas/pull/700/commits/a4a8bcc8c69a18efb8f5624129a0e281a0a512d0#diff-4fd17066615aa04304c49ac00877970bR104

dataset.info['version'].values.add().string_value = "version"
###
dataset.attributes.attr['version'].values.add().string_value = "version"
dataset.attributes.attr['species'].values.add().ontology_term.MergeFrom(myOntologyTerm)

You can compare the proto that backs the info field with the contents of this PR (https://github.com/google/protobuf/blob/master/src/google/protobuf/struct.proto#L93). Hopefully this will be largely code neutral and adds features that are needed to properly model loosely structured metadata. It also provides a path to better define the types represented in the metadata, but like the info field, will likely be used for string: [string] mappings much of the time.

The feature that results in the reference server is being able to provide an interface for interchanging JSON metadata, or any other hierarchically structured message of a mixture of GA4GH and internal protobuf types. Inspirations for nesting keys was developed to interchange structured tags in tagstorm format http://hgwdev.cse.ucsc.edu/~kent/tagStorm/testTagStorm.html .

A common use case request has been modeling loosely structured data, especially metadata, for example when interchanging phenotype data. The sequence annotations API's designers (@calbach @diekhans ) offered this message, this PR hopes to make the API consistent by supplying a single robust tagging method. I hope others will offer their perspective.

At first I found these data structures (info field) extremely overwrought for interchanging simplistic data. I have experience working with tag-bags that map strings to strings, and not being able to do this in just a few characters of code was frustrating and added a steep learning curve.

I was able to finally convince myself that type security and hierarchical mapping was important when working with variant set metadata in the variant annotations API. Often, analyses provide annotations in an unstructured manner, and these annotations were often nested arrays that needed to be split. It was my hope that no one would ever have to guess about whether a field is comma separated, contains booleans or strings, ontology terms or external identifiers, when using this API.

Hopefully, clients will obscure the type reflection that one would like when working with these attributes. Thanks for your considerations @ejacox. @bwalsh had stated something similar about the info field in the past, which I understand. However, I believe that type security, hierarchy, and using our own type system will enable a lot of common interchange use cases.

david4096 commented 7 years ago

Following discussion with @ejacox, we will be modifying this PR to allow other metadata entries to be gathered together. https://github.com/ga4gh/schemas/pull/690#issuecomment-240509046

The new signature will be variant.metadata.attributes, which will allow us to store all shared metadata elements on a single message in the proto.

andrewjesaitis commented 7 years ago

Hmmm. Don't think I really understand the purpose of this and think it may obscure the original intention of the info field.

If I remember correctly, the purpose of the info field was to store data defined in source that didn't explicitly have a place in the schema. Moving it to be under a per record metadata field seems to change or at least obscure this purpose, since it suggests that this is data about the record rather than data "of" the record but that didn't have an explicit field for it.

As an example, I don't really view the data in a VCF's INFO field as metadata, it is data about a variant. I wouldn't expect Sift or Polyphen scores to be metadata, but under this change it is where they would be stored.

ejacox commented 7 years ago

We propose to keep the timestamp info explicit (not in attributes/info). Rather, any info (metadata) that applies to data in general will be put into a metadata message. Explicit fields will still be explicit.

Our motivation is to gather all purpose fields in one place. With timestamps, we were beginning to put them all over the place. When we wanted to change them, we needed to touch many messages. This way, any future changes (type change, description change, etc.) to timestamps will occur in one place.

Does that address your concerns and still align with original intent @andrewjesaitis ?

david4096 commented 7 years ago

An alternative would be to add the metadata message that has named metadata fields like created and updated and then maintain the attributes at the level they are.

Given the Variant Set Metadata message uses the word "metadata", I can't mount much of an argument against the suggestion to use it as an accessor for the attributes.

Ideally only developers have to see this part of the model, and if you look at how applications have to reconcile the meaning of tags in the Variant protocol, it might make more sense to call it metadata as opposed to info, I believe. It would be great to provide a similar self-description protocol in other parts of the API. We need to try to be VCF compliant, but not rely on it.

I would identify one problem as mixing the term "metadata" here, because the variant set metadata serves a number of purposes. It has ID resolution, descriptions, and type definitions (to some extent). So a variant set provides metadata about a variant's attributes, and a server provides metadata about a variant's record.

As pointed out by @ejacox, we need to better specify the use cases around the timestamp. It's not clear if it is meant to be provided by the file. If it is left up to interpretation, it is a server specific message about a record, and maybe should go in a different named field: record_metadata.

mbaudis commented 7 years ago

@ejacox I do not get this. Obviously, in GA4GH we use "metadata" roughly as "everything but the sequence", not just as describing housekeeping data associated with a message.

And this is a view shared broadly outside GA4GH, when talking about data sharing concepts etc.

So, please, try not to mix this into "named attributes and then accompanying metadata" or such.

For time stamps and ids, I personally prefer the use of "housekeeping" attributes. The current info field just is for unassigned data.

So general agreement with @david4096, though IMO created/updated (please read up all the scattered discussions & history, including previous better naming record_created ...) are part of the record (and technically different from server time stamps; e.g you may send a record to another host without changing the values).

andrewjesaitis commented 7 years ago

I think I'm in agreement with @david4096 suggestion of keeping attributes as a typed key-value store to hold data that doesn't explicitly have a field in the model. So I think this PR currently satisfies that.

If we want to move timestamps to a separate metadata message that should be a separate PR. I'll throw my timestamp thoughts on the other issue to keep this PR focused.