Open gwaybio opened 8 years ago
To follow up - today, the data group should be focusing on extracting a table of the format:
sample_id | gene_mutation | DNA_change | protein_change |
---|---|---|---|
TCGA___ | TP53 | c.___ | p.Arg175Pro |
Note that the gene_mutation
column should be Entrez GeneIDs.
In speaking with a cancer biologist and collaborator about cognoma it was discovered that a huge win we could relatively easily deliver is classification performance (or classification scores) across different mutation types for an input gene. This would be extremely useful for a researcher who is interested in determining the pathogenicity of a particular mutation.
I believe that cognoma is an ideal way of approaching this problem. Typically, when genes mutate there is a range of severity regarding how the particular mutation impacts downstream changes. For a particularly virulent mutation, a classifier trained to detect an inactivation signature may output a higher score for those groups of samples, than other samples with a less virulent mutation.
In my eyes, this particular issue bypasses the machine learning group - they will still work with the previously defined
Y
matrices. However, in order for the backend to serve the frontend information from the database about each sample's mutation so that the frontend can visualize the results we need to know how to parse this information.I looked briefly at the information embedded in the PANCAN mutation data - particularly the columns labeled
HGVSc
andHGVSp
. These columns hold standard ways of storing specific mutation calls. More information about these standards are provided by the HGVS website.