ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Individual Phenotype Representation #254

Open kellrott opened 9 years ago

kellrott commented 9 years ago

I would like to question the way phenotypes are currently embedded in the 'Individual' structure ( https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/metadata.avdl#L96 ). The comments question 'Is this the right representation?', and I would point out that the Genotype2Phenotype group is currently working to create the 'Association' data structure ( https://github.com/kellrott/schemas/blob/g2p/src/main/resources/avro/genotypephenotype.avdl#L97 )

Under this schema, the phenotype would be linked to the Individual via an 'Association', which would provide the opportunity to provide evidence for the association. The same association data structure can also be used to link samples, phenotypes and genomic features to phenotypes.

mbaudis commented 9 years ago

Thanks for pointing this out. We'll have to look into this, but on a first glance e.g. the AssociationType is way too rigid/overspecified. Also, we're getting rid of ENUMS if possible ...

enum AssociationType { VARIANT_PHENOTYPE, GENOMICFEATURE_PHENOTYPE, SAMPLE_PHENOTYPE, INDIVIDUAL_PHENOTYPE, SAMPLE_PHENOTYPE }

kellrott commented 9 years ago

The enum is an attempt to get a data structure that 'subclasses'. The association represent multiple possible associations, but rather then put each of the into different data structures (like a VariantPhenotypeAssocation and a SamplePhenotypeAssociation structure) a single 'super' structure is used. The enumeration is a quick way for the user to determine the association type, rather then having to scan each possible permutation of non-null connectors. At the same time, only particular associations are valid/make sense. An association connecting a Sample and an Individual need evidence, a Sample should simply belong to an Individual

mbaudis commented 9 years ago

Samples don't have to be children of individuals; they can be pooled, environmental ... There is a larger scope, not only specific to human disease etc.

But my comment is more about the technical aspect: An enum is a very rigid structure, which only can be modified with schema updates. We just gut rid of the one for GeneticSex (not solving it, though ...).

I'm actually quite positive to have it now on the level to discuss & modify this here ;-)

kellrott commented 9 years ago

Thank you for your comments. One of the questions about this data structure is how much should we try to 'protect' users from creating bad data. As it stands, they could fill in the wrong optional fields for the declared type. So there are three ways to go: 1) Force the enumeration, because it refers to components of the schema, and therefore shouldn't change unless the schema itself is changing 2) Use a string field that should have a predefined string in it, ie 'variant_to_phenotype', but also allows them to fill in complete nonsense 3) Completely remove the 'type' declaration, and force the end user to infer the association type by scanning which connector fields are non-null.

I'm curious about the thoughts about this from the GA4GH community.

buske commented 9 years ago

@kellrott As one more example, you can take a look at the Matchmaker API's representation of phenotype terms (called Features): PR #258. They're basically OntologyTerms, but have some additional metadata (ageOfOnset and observed at the moment).

pgrosu commented 9 years ago

I am in favor of controlled vocabulary, since it will makes searches more robust. This will allow for the possibility of find something similar to your phenotype search. The only thing to maybe take into account - that I've seen in the past - is where there was a huge amount of choices under some categories that caused people to select the most general, which made later integrative searches of comparing datasets almost impossible. Most folks prefer to keep filtering their searches with new searches on the search result, rather than writing one long, detailed search. So keeping these categories as clean as possible will make them more user- and scientist-friendly.