cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Precomputing a sample × mutation-in-gene-set matrix #21

Open stephenshank opened 7 years ago

stephenshank commented 7 years ago

At the 8/23 meetup, @dhimmel expressed interest in incorporating metabolic pathway information by combining the dataset that we have and the hetnet database that was described at the first meetup. The hetnet has information on what pathways the mutated genes in the current dataset participate in.

I figured I'd open this issue to get the conversation started. Initially, I am wondering what this dataset would look like, and do we envision it being created from what we already have? And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

gwaybio commented 7 years ago

I think this would be the next logical step for the cancer data group - and like @stephenshank mentioned, would require some communication with the ML group.

I did some work on this issue today and am shooting to file a pull request in the ML group tomorrow afternoon.

I am wondering what this dataset would look like, and we envision it being created from what we already have?

From my perspective, you can think of this matrix as very similar to the gene-based mutation matrix except with the gene names as columns, there will be pathways.

And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

Tweaking to the actual classifier is extremely minimal. The algorithm will simply take in a Y matrix of {0,1} where 1 means a mutation in any gene in the pathway. The visualizations of input data and classifier performance on a per tissue basis is where this approach is likely to have the most difference

cgreene commented 7 years ago

I think that the long-term aim of this part is to do queries to the live hetnet database to return a gene set. This way, whenever the hetnets get updated, we automatically get the improved versions. It may be best to start there (queries against the live hetnets) instead of a downloaded version.

dhimmel commented 7 years ago

the long-term aim of this part is to do queries to the live hetnet database

Agreed, but I think there is an R&D argument for generating a sample by pathway matrix. For example, we will want to know the distribution of positive prevalence across all pathways.

@stephenshank, if you're still interested in this task, I recommend it. It will be convenient to have a cached mutation matrix for gene sets rather than genes.

You can still work with Hetionet Cypher queries to construct this dataset, as @gwaygenomics started in https://github.com/cognoma/machine-learning/pull/39.

dhimmel commented 7 years ago

Also interesting is how often does Hetionet return genes that aren't in our mutation dataset.

stephenshank commented 7 years ago

@dhimmel I believe I'm ready to submit a PR for this, but had one quick question. The resulting sample-pathway matrix is about 26 MB uncompressed. I wasn't sure how big was too big to track, or if we want to track compressed files. Any suggestions would be most appreciated.

dhimmel commented 7 years ago

Can you bz2 compress the file so it's smaller? Our data/.gitignore file will then make sure the dataset isn't tracked.

stephenshank commented 7 years ago

See #25.