draft of db schema for G2P associations

heckerma commented 10 years ago

Here's a rough draft of a db schema for G2P associations (in normal form). Currently, it only handles univariate associations.

David Heckerman, Christoph Lippert, Chris Widmer

Study id (unique global id) Date of entry PUBMEDID or link to paper Organism Cohort(s) Primary or meta analysis? sample size number of tests performed All tests reported? model used (linear regression, logistic regression, linear mixed model, etc.) phenotype transformation used test statistic used list of primary studies (if meta analysis)

Variant id (unique global id) Variant type (SNP, methylation, CNV, SNP set, etc.) Risk Allele(s) Genome locus Genome position Genome position reference Platform used to measure

Phenotype id (unique global id) Phenotype type (disease, drug response, gene expression, etc.) Platform used to measure

Association stats p-value effect size effect size std error Bayes factor

Association Study id Variant id(s) Phenotype id(s) Association stats (one for initial test, then one for each validation)

cmungall commented 10 years ago

Is "phenotype type" a free text field?

We need a way to get stats into the current schema draft. Presumably different methods will have different stats so we may need a generic tag-value metadata system

heckerma commented 10 years ago

Is "phenotype type" a free text field? It would be useful to have at least some predefined types, but “other” (with free text) will likely always be useful as it will be tough to keep up with new types coming online.

Presumably different methods will have different stats

Yes, for example, we allow for both p-value and bayes factor.

From: Chris Mungall [mailto:notifications@github.com] Sent: Thursday, November 20, 2014 5:10 PM To: cmungall/schemas Cc: David Heckerman Subject: Re: [schemas] draft of db schema for G2P associations (#2)

Is "phenotype type" a free text field?

We need a way to get stats into the current schema draft. Presumably different methods will have different stats so we may need a generic tag-value metadata system

— Reply to this email directly or view it on GitHubhttps://github.com/cmungall/schemas/issues/2#issuecomment-63909577.

heckerma commented 10 years ago

After a bit more thought, we don't think "Association stats" should include validations. Instead, validations should be recorded in separate studies. Whether an association is validated can be assessed via query. David, Chris, and Christoph

kellrott commented 10 years ago

The 'Variant id(s)' and 'Phenotype id(s)' (plural) raises the question of rather we want the topology to represent a regular graph or a multigraph. Its probably better to have plural concepts on both sides and connect them as a regular graph. There is already precedent on the Variant side (the VariantSet structure). Do we need a similar concept on the Phenotype side? Something like a PhenotypeSet, that lets you composite multiple phenotypes together into a single concept (drug resistance AND proliferative)?

heckerma commented 10 years ago

yes, there's lots of interest in PhenotypeSet work, e.g., http://biorxiv.org/content/early/2014/05/22/003905 and http://www.nature.com/nmeth/journal/v11/n4/full/nmeth.2848.html

D, C, and C

kellrott commented 10 years ago

I've posted notes about our schema discussions in the main GA4GH issue board (https://github.com/ga4gh/schemas/issues/196). We should move our conversations over there, so the larger group can see what we're working on.

heckerma commented 10 years ago

Thanks Kyle!

From: Kyle Ellrott [mailto:notifications@github.com] Sent: Tuesday, December 02, 2014 11:00 PM To: cmungall/schemas Cc: David Heckerman Subject: Re: [schemas] draft of db schema for G2P associations (#2)

I've posted notes about our schema discussions in the main GA4GH issue board (ga4gh#196https://github.com/ga4gh/schemas/issues/196). We should move our conversations over there, so the larger group can see what we're working on.

— Reply to this email directly or view it on GitHubhttps://github.com/cmungall/schemas/issues/2#issuecomment-65363670.

cmungall commented 10 years ago

PhenotypeSets:

for the 'proliferative drug resistance' scenario, the way to model with with the current proposed schema would be {qualifier: proliferative, phenotype: drug resistance}
We may end up with multiple ways of saying the same thing: (1) multiple associations to singleton phenotypes, (2) single associations to phenotype sets, (3) single associations to singleton phenotype ontology terms, where the singleton phenotype ontology term is a pre-composed concept.
The semantics have to be made very clear. E.g. presumably a PhenotypeSet would be interpreted as a conjunction of concepts. However, later on, people may want to express disjunctions. We may end up re-inventing OWL inside Avro which would probably not be a good idea. It would be good to get a better idea of scope and use cases before proceeding too far here.

heckerma commented 10 years ago

Actually, in practice, phenotypes are mainly considered in disjunction (to increase power).

cmungall / schemas

draft of db schema for G2P associations #2