Endpoint/Models for "samples/examples"

cgreene commented 8 years ago

Researchers will select which samples they want to include in the analysis. From a machine learning point of view, we mean which examples are relevant to the researcher. These samples will have various metadata. The GDC Data Portal [ https://gdc-portal.nci.nih.gov/search/s ] has a very nice interface for these metadata. Essentially the facets on the left for "cases" are the same ones that we would expect to be relevant here.

gwaybio commented 8 years ago

The GDC portal is a good example of a friendly user interface and a good starting point to describe what a sample selector for this type of data should be able to do. For our purposes, however, I think we will need our interface to communicate with the gene selector. We don't necessarily want a user to have an option to select a tissue that is likely to drive poor classifier performance if the tissue does not have enough mutations to contribute. For example, if I choose to classify RAS mutations I don't necessarily want breast tumors in my classifier because they will add over 1,000 tumors with few RAS mutations and could saturate the negative samples in the classifier.

cgreene commented 8 years ago

I created a quick class diagram of the items that I think have been sufficiently specified based on discussions thus far to start implementation of the models [Samples, Genes, Mutations, Mutation Types]. I went ahead and assumed we'd install django-genes and django-organisms in this project, as that lets use use what is there. At least django-genes will need a rest API but it already provides an elasticsearch index that will be useful to find the right gene when a user types an identifier.

django-cognoma

I'll create a pull request with the XML form from draw.io that we can edit.

awm33 commented 8 years ago

@cgreene This is great! Would you be able to indicate data type for each? If a field is an enumeration, what the potential values could be?

I assume an id int auto incrementing PK on each model. I would also recommend created_at and updated_at fields on each. We may also want to consider no deletes or soft deletes using deleted_at

cgreene commented 8 years ago

Will fill in what I can but probably need the cognoma/cancer-data team to chime in. This is generally using text, unless i'm absolutely convinced that an enum or more complex approach makes sense. The cancer-data needs to fill some of these in (like age at diagnosis - I made it an integer but not sure it actually is in the data).

Sample:

Site: Short string
Project: Short string
Disease type: Short string
age_at_diagnosis: int
Gender: enum (male, female, unknown)
Vital: enum (alive, deceased, unknown)
days_to_death: int
Race: string? [data team?]
Ethnicity: string? [data team?]

cgreene commented 8 years ago

Here is the ultra stripped down version requested by @aelkner at the meetup last night. img_20160810_144027

ypar commented 8 years ago

also @awm33, we can start setting up thing using subsets of input data.

sample table is downloadable by this link here

one example of a mutation table is here

awm33 commented 7 years ago

https://github.com/cognoma/core-service/pull/25

cognoma / core-service

Endpoint/Models for "samples/examples" #2