choderalab / TargetExplorer

Database framework with RESTful API for aggregating genomic, structural, and functional data for target protein families.
GNU General Public License v2.0
6 stars 7 forks source link

Robust way to handle Unknown genes for cBioPortal #27

Open steven-albanese opened 7 years ago

steven-albanese commented 7 years ago

I've run into the following error for a handful of genes:

/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/flask/exthook.py:71: ExtDeprecationWarning: Importing flask.ext.sqlalchemy is deprecated, use flask_sqlalchemy instead.
  .format(x=modname), ExtDeprecationWarning
/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/flask/exthook.py:71: ExtDeprecationWarning: Importing flask.ext.sqlalchemy is deprecated, use flask_sqlalchemy instead.
  .format(x=modname), ExtDeprecationWarning
/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/flask/exthook.py:71: ExtDeprecationWarning: Importing flask.ext.sqlalchemy._compat is deprecated, use flask_sqlalchemy._compat instead.
  .format(x=modname), ExtDeprecationWarning
Current crawl number: 0
Retrieving new cBioPortal data file from server...
Retrieving ExtendedMutation data from cBioPortal for study paac_jhu_2014...
# Warning:  Unknown gene:  COQ8A
Traceback (most recent call last):
  File "/Users/albaness/miniconda3/envs/py27/bin/DoraGathercBioPortal.py", line 4, in <module>
    __import__('pkg_resources').run_script('targetexplorer==0.2', 'DoraGathercBioPortal.py')
  File "/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/setuptools-23.0.0-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script
  File "/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/setuptools-23.0.0-py2.7.egg/pkg_resources/__init__.py", line 1504, in run_script
  File "/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/targetexplorer-0.2-py2.7.egg/EGG-INFO/scripts/DoraGathercBioPortal.py", line 23, in <module>
    commit_to_db=not args.nocommit
  File "/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/targetexplorer-0.2-py2.7.egg/targetexplorer/cbioportal.py", line 52, in __init__
    self.get_mutation_data_as_xml()
  File "/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/targetexplorer-0.2-py2.7.egg/targetexplorer/cbioportal.py", line 82, in get_mutation_data_as_xml
    write_extended_mutation_txt_files=self.write_extended_mutation_txt_files,
  File "/Users/albaness/miniconda3/envs/py27/lib/python2.7/site-packages/targetexplorer-0.2-py2.7.egg/targetexplorer/cbioportal.py", line 364, in retrieve_mutants_xml
    raise Exception
Exception

I know this can be handled by adding the unknown genes to the manual override file, but is there a better way to handle these cases?

jchodera commented 7 years ago

Can we just catch those exceptions in a try...except block?

try:
   # do stuff
except Exception as e:
   # log the exception

Ideally, we'd be able to discriminate things we should just log and move on from (the gene ID is not found) from other more serious errors. Perhaps retrieve_mutants_xml should return None or get_mutation_data_as_xml should check if the gene exists and handle this gracefully? Or maybe we can check if the gene exists earlier in the process before retrieving mutations?

steven-albanese commented 7 years ago

Ok for right now, I've just changed it to print the name of the gene and then continue instead of raising an exception in PR #26. I think this is fine for now, but we should probably come up with a better system of logging these errors after the grant deadline

steven-albanese commented 7 years ago

I was able to figure out which gene was causing the problem, so I added it to the manual overrides. Unfortunately, that makes the URL too long and I get the following error:

urllib2.HTTPError: HTTP Error 414: Request-URI Too Large

I'm trying to figure out the best way to fix this

steven-albanese commented 7 years ago

Looks like this function is the problem:

def retrieve_extended_mutation_datatxt(case_set_id,
                                       genetic_profile_id,
                                       gene_ids,
                                       portal_version='public-portal',
                                       write_to_filepath=False
                                       ):
    """
    Queries cBioPortal for "ExtendedMutation" format data, given a list of cBioPortal cancer studies and a list of HGNC Approved gene Symbols.
    Returns the data file as a list of text lines.

    Parameters
    ----------
    portal_version: str
        'public-portal': use only public cBioPortal data
        'private': use private cBioPortal data
    write_to_filepath: str (or False)
    """
    gene_ids_string = '+'.join(gene_ids)
    mutation_url = 'http://www.cbioportal.org/{0}/' \
                   'webservice.do' \
                   '?cmd=getMutationData' \
                   '&case_set_id={1}' \
                   '&genetic_profile_id={2}' \
                   '&gene_list={3}'.format(
                       portal_version,
                       case_set_id,
                       genetic_profile_id,
                       gene_ids_string
                   )
    response = urllib2.urlopen(mutation_url)
    page = response.read(1000000000)
    if write_to_filepath:
        with open(write_to_filepath, 'w') as ofile:
            ofile.write(page)
    lines = page.splitlines()
    return lines

When working with the whole kinome, the url is too long. I'm not very familiar with this, but what I've seen online is that there is a character limit for the urls here. I've seen a few different ways to correct this, but I'm not familiar with flask to know which one is appropriate in our case

steven-albanese commented 7 years ago

This issue has been addressed with the inclusion of try..except blocks of code as well as chunking the list of genes when requesting information.

A more elegant proposal could be made to handle the header information as well as the Unknown gene warning discussed in the PR #26