Closed cgreene closed 7 years ago
The GDC portal is a good example of a friendly user interface and a good starting point to describe what a sample selector for this type of data should be able to do. For our purposes, however, I think we will need our interface to communicate with the gene selector
. We don't necessarily want a user to have an option to select a tissue that is likely to drive poor classifier performance if the tissue does not have enough mutations to contribute. For example, if I choose to classify RAS mutations I don't necessarily want breast tumors in my classifier because they will add over 1,000 tumors with few RAS mutations and could saturate the negative samples in the classifier.
I created a quick class diagram of the items that I think have been sufficiently specified based on discussions thus far to start implementation of the models [Samples, Genes, Mutations, Mutation Types]. I went ahead and assumed we'd install django-genes and django-organisms in this project, as that lets use use what is there. At least django-genes will need a rest API but it already provides an elasticsearch index that will be useful to find the right gene when a user types an identifier.
I'll create a pull request with the XML form from draw.io that we can edit.
@cgreene This is great! Would you be able to indicate data type for each? If a field is an enumeration, what the potential values could be?
I assume an id int auto incrementing PK on each model. I would also recommend created_at and updated_at fields on each. We may also want to consider no deletes or soft deletes using deleted_at
Will fill in what I can but probably need the cognoma/cancer-data team to chime in. This is generally using text, unless i'm absolutely convinced that an enum or more complex approach makes sense. The cancer-data needs to fill some of these in (like age at diagnosis - I made it an integer but not sure it actually is in the data).
Sample:
Here is the ultra stripped down version requested by @aelkner at the meetup last night.
Researchers will select which samples they want to include in the analysis. From a machine learning point of view, we mean which examples are relevant to the researcher. These samples will have various metadata. The GDC Data Portal [ https://gdc-portal.nci.nih.gov/search/s ] has a very nice interface for these metadata. Essentially the facets on the left for "cases" are the same ones that we would expect to be relevant here.