Extract detailed mutation information for TCGA samples

gwaybio commented 8 years ago

In speaking with a cancer biologist and collaborator about cognoma it was discovered that a huge win we could relatively easily deliver is classification performance (or classification scores) across different mutation types for an input gene. This would be extremely useful for a researcher who is interested in determining the pathogenicity of a particular mutation.

I believe that cognoma is an ideal way of approaching this problem. Typically, when genes mutate there is a range of severity regarding how the particular mutation impacts downstream changes. For a particularly virulent mutation, a classifier trained to detect an inactivation signature may output a higher score for those groups of samples, than other samples with a less virulent mutation.

In my eyes, this particular issue bypasses the machine learning group - they will still work with the previously defined Y matrices. However, in order for the backend to serve the frontend information from the database about each sample's mutation so that the frontend can visualize the results we need to know how to parse this information.

I looked briefly at the information embedded in the PANCAN mutation data - particularly the columns labeled HGVSc and HGVSp. These columns hold standard ways of storing specific mutation calls. More information about these standards are provided by the HGVS website.

gwaybio commented 8 years ago

To follow up - today, the data group should be focusing on extracting a table of the format:

sample_id	gene_mutation	DNA_change	protein_change
TCGA___	TP53	c.___	p.Arg175Pro

dhimmel commented 8 years ago

Note that the gene_mutation column should be Entrez GeneIDs.

cognoma / cancer-data

Extract detailed mutation information for TCGA samples #15