cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Map mutation gene symbols to Entrez IDs #12

Closed clairemcleod closed 8 years ago

clairemcleod commented 8 years ago

This pull requests addresses Issues #4 and #6. It adds to 2.TCGA-process.ipynb and includes a mapping of mutation gene symbol to Entrez ID as part of the processing workflow.

The mapping is conducted in two stages. First, gene symbols are mapped based on the combination of chromosome # and gene symbol of record. This maps ~95% of observed mutations. Next, yet-unmapped gene symbols are mapped based on the combination of chromosome # and alternate gene symbols. Following the second mapping, ~98% of observations are mapped. The remaining ~2% were either ambiguous mappings or un-mappable; this 2% is currently discarded before writing the data out.

dhimmel commented 8 years ago

Can you export the notebooks to scripts using:

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb

This will make it easier to review the changes.

dhimmel commented 8 years ago

Note that the email configuration in your git, doesn't match you GitHub account email. This makes it so your commits aren't attributed to your profile. See more info here.

clairemcleod commented 8 years ago

@dhimmel Here is the exported script. Good catch with the email - and thanks for point it out. I think it is rectified now?

dhimmel commented 8 years ago

Yep, your new commits are associated with your GitHub account.

dhimmel commented 8 years ago

General comments

Great work with this pull request!

I think you should separate the entrez gene processing to it's own notebook. For example, 2-entrez-gene-extract.ipynb. This notebook should export one file for now (we will probably have it export more in the future) named entrez-gene-symbol-map.tsv or similar. It should have three columns: entrez_gene_id, symbol, chromosome. There should only be rows for unambigious mappings. For example, run drop_duplicates with keep=False.

In 3.TCGA-process.ipynb, we could then use the merge command with how='inner (as you're doing, but no need to combine symbol and chromosome to a single column.

I also think we may want to consider the following approach:

  1. Construct entrez_gene_id, symbol, chromosome dataframe from only primary symbols.
  2. Construct entrez_gene_id, symbol, chromosome dataframe from only synonyms and run drop_duplicates with keep=False.
  3. Concatenate the dataframes from 1 and 2 and drop_duplicates with keep='first'.

This approach gives primacy to official symbols (i.e. we don't blacklist official symbols because there's a colliding synonym on the same chromosome), but we still obliterate colliding synonyms. Does that make sense?

dhimmel commented 8 years ago

Make sure to subset for tax_id = 9606 (Homo sapiens) from the get go. It's a real gotcha with the Homo_sapiens.gene_info.gz file.

clairemcleod commented 8 years ago

@dhimmel These are all great points - thanks for the feedback. Would it be best to cancel/close this pull request and resubmit once the changes are made, or to keep the pull request open while I make the changes?

dhimmel commented 8 years ago

Would it be best to cancel/close this pull request and resubmit once the changes are made, or to keep the pull request open while I make the changes?

I suggest keeping the pull request open. Any commits you make to your master branch will get added to this pull request.

clairemcleod commented 8 years ago

@dhimmel Sorry for the delay - I think I've addressed all of these points but let me know if I missed any or new ones have popped up.

edit: also tagging @Inquisitive-Geek

dhimmel commented 8 years ago

@clairemcleod awesome.

@gwaygenomics would you like to spend ~15 minutes with the cancer-data group tonight reviewing this pull request?

dhimmel commented 8 years ago

Looks like there were only a few small comments and then this will be ready to merge.

I may be AFK, so @gwaygenomics you can do the merge when ready. I recommend a squash commit here.

gwaybio commented 8 years ago

@clairemcleod @dhimmel - Looks great to me!