chembl / chembl_webresource_client

Official Python client for accessing ChEMBL API
https://www.ebi.ac.uk/chembl/api/data/docs
Other
368 stars 95 forks source link

Setting format doesn't work for batch molecule.get() #68

Closed fredrikw closed 4 years ago

fredrikw commented 4 years ago

When trying to get multiple molecules through molecule.get(['CHEMBL6498', 'CHEMBL6499', 'CHEMBL6505']) while having set_format('sdf') gives an exception:

molecule.set_format('json')
records1 = molecule.get(['CHEMBL6498', 'CHEMBL6499', 'CHEMBL6505'])
records2 = molecule.get(['XSQLHVPPXBBUPP-UHFFFAOYSA-N', 'JXHVRXRRSSBGPY-UHFFFAOYSA-N', 'TUHYVXGNMOGVMR-GASGPIRDSA-N'])
records3 = molecule.get(['CNC(=O)c1ccc(cc1)N(CC#C)Cc2ccc3nc(C)nc(O)c3c2',
            'Cc1cc2SC(C)(C)CC(C)(C)c2cc1\\N=C(/S)\\Nc3ccc(cc3)S(=O)(=O)N',
            'CC(C)C[C@H](NC(=O)[C@@H](NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H]3CCCN3C(=O)C(CCCCN)CCCCN)C(C)(C)C)C(=O)O'])
records1 == records2 == records3

will retrun True while

molecule.set_format('sdf')
records1 = molecule.get(['CHEMBL6498', 'CHEMBL6499', 'CHEMBL6505'])
records2 = molecule.get(['XSQLHVPPXBBUPP-UHFFFAOYSA-N', 'JXHVRXRRSSBGPY-UHFFFAOYSA-N', 'TUHYVXGNMOGVMR-GASGPIRDSA-N'])
records3 = molecule.get(['CNC(=O)c1ccc(cc1)N(CC#C)Cc2ccc3nc(C)nc(O)c3c2',
            'Cc1cc2SC(C)(C)CC(C)(C)c2cc1\\N=C(/S)\\Nc3ccc(cc3)S(=O)(=O)N',
            'CC(C)C[C@H](NC(=O)[C@@H](NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H]3CCCN3C(=O)C(CCCCN)CCCCN)C(C)(C)C)C(=O)O'])
records1 == records2 == records3

will return

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-f90ba5a6c0e6> in <module>
      1 molecule.set_format('sdf')
----> 2 records1 = molecule.get(['CHEMBL6498', 'CHEMBL6499', 'CHEMBL6505'])
      3 records2 = molecule.get(['XSQLHVPPXBBUPP-UHFFFAOYSA-N', 'JXHVRXRRSSBGPY-UHFFFAOYSA-N', 'TUHYVXGNMOGVMR-GASGPIRDSA-N'])
      4 records3 = molecule.get(['CNC(=O)c1ccc(cc1)N(CC#C)Cc2ccc3nc(C)nc(O)c3c2',
      5             'Cc1cc2SC(C)(C)CC(C)(C)c2cc1\\N=C(/S)\\Nc3ccc(cc3)S(=O)(=O)N',

~/chembl_webresource_client/query_set.py in get(self, *args, **kwargs)
    174     def get(self, *args, **kwargs):
    175         if args:
--> 176             return self.query.get(*args, **kwargs)
    177         if kwargs:
    178             clone = self._clone()

~/chembl_webresource_client/url_query.py in get(self, *args, **kwargs)
    268     def get(self, *args, **kwargs):
    269         if args:
--> 270             return self._get_by_ids(args[0])
    271         if kwargs and self.allows_list:
    272             return self._get_by_names(*list(kwargs.items())[0])

~/chembl_webresource_client/url_query.py in _get_by_ids(self, ids)
    341         self.logger.info('From cache: {0}'.format(res.from_cache if hasattr(res, 'from_cache') else False))
    342         if res.ok:
--> 343             self._gather_results(res, ret)
    344         else:
    345             handle_http_error(res)

~/chembl_webresource_client/url_query.py in _gather_results(self, request, ret)
    354         elif self.frmt in ('mol', 'sdf'):
    355             sdf_data = request.text.encode('utf-8')
--> 356             ret.extend(sdf_data.split('$$$$\n'))
    357         else:
    358             xml = parseString(request.text)

TypeError: a bytes-like object is required, not 'str'

The problem is that the code splitting a bytes object with a "regular" string in _gather_results, leading to a TypeError. In addition the end point /chembl/api/data/molecule/set/ should get the format specified as ?format=sdf etc. That is not present in _get_by_ids why we get an xml-document that will have problems in the parsing since a sdf file is expected.

I will prepare a PR to address these problems shortly.