cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Add exploratory analyses of mutation data #22

Closed dhimmel closed 7 years ago

dhimmel commented 7 years ago

This pull request is based on a preliminary notebook we created at the 2016-08-23 Cognoma Meetup. Tagging @mike1906 @stephenshank, @drolejoel, @linzho, who were part of this group (we'd love your feedback).

Specifically, I'd like feedback on interested cancer genes where we expect to see mutation status segregate with disease. For example, the present notebook shows the enrichment of VHL for kidney clear cell carcinoma.

gwaybio commented 7 years ago

BRAF should segregate to melanoma and subsets of lung cancer

BRAFV600E should be a good test for the machine learning group once we get the columns mentioned in #16

Can also visualize BRCA1 and BRCA2 - will largely segregate into breast and subsets of ovarian, cervical, and uterine cancers as well.

gwaybio commented 7 years ago

Can you also add ALK - should segregate into subsets of lung cancer. ALK is interesting because it is inactivated usually by chromosomal rearrangements and I suspect a gene expression signature for ALK inactivation could be interesting

gwaybio commented 7 years ago

could possibly incorporate COSMIC here too

linzho commented 7 years ago

You can also look at MEN1 and RET, genes which is associated with a lot of neuroendocrine things (pancreas, pituitary, parathyroid, medullary thyroid, pheochromocytoma)

Are you interested in genes associated with cancers in general, or genes where we might expect that the majority of cancers segregate with a single gene?

dhimmel commented 7 years ago

Are you interested in genes associated with cancers in general, or genes where we might expect that the majority of cancers segregate with a single gene?

@linzho both. Since this is an exploratory analysis, I'm just looking to look!

dhimmel commented 7 years ago

@linzho & @gwaygenomics thanks for your suggestions. I added them to the heatmap in 29c926ab3de9e8a7b95b79ac582e295ffc5f41f3, which now looks like this:

heatmap

I also scaled the mutation rates for each gene by the max mutation rate. Note that there is still the outstanding issue that some diseases harbor more mutations (see row-wise bands above & https://github.com/cognoma/machine-learning/issues/8).

gwaybio commented 7 years ago

would it be useful to add functionality to the script? if the final output is the mutation by tissue heatmap could you add an argparse argument? So the above graph would be generated like:

python scripts/3.explore-mutations.py --gene-list "BRCA2,ALK,CD274,MEN1,VHL,RET,TP53,BRCA1"

just a thought

dhimmel commented 7 years ago

@gwaygenomics I have a slightly different philosophy here.

scripts/3.explore-mutations.py is an auto-exported script version of the notebook for diff viewing. So all code changes should be done to the notebook. Passing args to the notebook doesn't make sense because you should be able to use notebooks interactively.

So one option is to create a python module, e.g. heatmap.py which has a function that 3.explore-mutations.ipynb would call and has a __main__ that could enable script execution. However, I don't really see a major benefit that justifies the added complexity. If you want to add more genes, you can just open the notebook and add genes to the dictionary.

IMO, notebooks are better than scripts with arguments for agile data science.

gwaybio commented 7 years ago

got it - i agree for this script.

Although I do think that moving towards this philosophy in terms of thinking about functionality for how a user will visualize input genes and input tissues (i.e. the frontend/cancer data discussion yesterday - see cognoma/frontend#12) will be important.

LGTM :+1: