new_client.similarity.filter returns 502 errors with low similarity threshold - Githubissues

chembl / chembl_webresource_client

Official Python client for accessing ChEMBL API

https://www.ebi.ac.uk/chembl/api/data/docs

Other

377 stars 95 forks source link

new_client.similarity.filter returns 502 errors with low similarity threshold #28

Closed Swarchal closed 6 years ago

Swarchal commented 7 years ago

I keep running into 502 errors when searching for similar molecules based on smile strings. I'm not sure if it's just a bad couple of days for the EMBL servers, or if there's something wrong with the way I'm querying this?

It doesn't seem to fail at a particular smile string, and as it caches, if I re-run it does make progress.

from chembl_webresource_client.new_client import new_client

# active_smiles = list of roughly 1,000 smile strings

similarity_query = new_client.similarity                               
dark_smiles = []                                                       
for smile in active_smiles:
    res = similarity_query.filter(smiles=smile, similarity=70)
    if len(res) == 0:
        dark_smiles.append(smile)

Raises:

HttpBadGateway: Error for url https://www.ebi.ac.uk/chembl/api/data/similarity.json, server response: <!DOCTYPE
 HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/chembl/api/data/similarity.json">POST&nbsp;/chembl/
api/data/similarity.json</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.2.15 (Red Hat) Server at www.ebi.ac.uk Port 80</address>
</body></html>

mnowotka commented 7 years ago

Interesting! Thanks for reporting this, I'll have a look.

mnowotka commented 7 years ago

@Swarchal - can you please paste the list of your smiles if possible? If not, at least one particular that fails? (The list would be better, I could add it to my acceptance tests)

mnowotka commented 7 years ago

Ah if case this 'doesn't seem to fail at a particular smile string, and as it caches, if I re-run it does make progress' I'm not so sure if there is anything that can be done here. If you provide a large enough SMILES string with small enough threshold that yields thousands of results then time taking to collect them will exceed the Apache timeout and you will get 502. Next time result will be taken from cache so there is a chance you will get the correct results. I may implement this asynchronously as ChEMBL grows but this is not a trivial change. Increasing gateway timeout may solve a problem in most cases but not all of them. Faster catridge and sharding also may help but as I said this won't be an immediate fix.

I suggest you can either hammer the API for as long as you will get correct results or download smiles and use chemfp while I come with some better solution on the API side of things. Still a representative set of SMILES would be helpful.

mnowotka commented 7 years ago

Also an information that it used to be faster would be helpful in which case I can raise the issue with our DBA team.

Swarchal commented 7 years ago

Wow, quick response.

I've ran the same list of smiles before without issues, but that was with a higher similarity threshold (85).

Here's a superset of the smile strings, the ~1,000 I'm using in the code are within there -- hush-hush data and all that.

mnowotka commented 7 years ago

Perfect, I'll have a look. General note is that as the threshold goes lower, exponentially more similar compounds are found. bench

Swarchal commented 7 years ago

It runs without issue if I increase similarity from 70 => 75.

mnowotka commented 7 years ago

Good to know, I also checked and the cartidge is in heavy use at the moment as we are pregenerating substructure search cache for the bugfix release on Monday. So please rerun your stuff nex week but I'll try as well and probably during the release tune the timeout so your compounds will (mostly) pass next time.

mnowotka commented 6 years ago

This should be much faster now and no 502 erorrs anymore. @Swarchal, can you please check?

Swarchal commented 6 years ago

Just tried again with the master branch, seem to be getting the same error, but it ran much longer before returning an exception.

Traceback (most recent call last):
  File "test_chembl_fix.py", line 35, in <module>
    if len(res) == 0:
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/query_set.py", line 98, in __len__
    return len(self.query)
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 150, in __len__
    self.get_page()
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 383, in get_page
    handle_http_error(res)
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/http_errors.py", line 113, in handle_http_error
    raise exception_class(request.url, request.text)
chembl_webresource_client.http_errors.HttpBadGateway: Error for url https://www.ebi.ac.uk/chembl/api/data/similarity.json, server response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/chembl/api/data/similarity.json">POST&nbsp;/chembl/api/data/similarity.json</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.2.15 (Red Hat) Server at www.ebi.ac.uk Port 80</address>
</body></html>

mnowotka commented 6 years ago

OK, just to clarify: no changes have been made to the client. On the server side I:

increased the proxy timeout to 300s.
changed gunicorn worker class from sync to genevnt so long running task won't block other requests.
tuned the performance using yandex.tank and as a result increased the number of workers on a single machine from 8 to 24.
configured workers to restart every 1k requests to prevent memory leaks and fall of performance over time.

One thing I don't understand is why the client ignores TOTAL_RETRIES setting which defaults to 3. I'll check this but this still won't solve the problem of similarity running slow, I need to profile SQL statements.

mnowotka commented 6 years ago

OK, I've spent some time on this and I belive this is fixed now. Please do the following:

Upgrade the client to the latest version (0.9.30)
This version introduces "only" operator. "only" specifies which fields should be retrived. This is important in case of the "similarity" andpoint because it shows a lot of information about molecules, which is expensive due to many joins. But in your case (which is actally pretty common) you just want to see which molecules are hit (actually you only want to know the number or if the number is zero). So you can now instruct the API to return only molecule identifiers and entirely skip joins:

from chembl_webresource_client.new_client import new_client
similarity_query = new_client.similarity
dark_smiles = []
with open('12K_smile_strings.smi') as f:
    content = f.readlines()

for idx, line in enumerate(content):
    smile = line.strip()
    res = similarity_query.filter(smiles=smile, similarity=70).only(['molecule_chembl_id'])
    print("{0} {1} {2}".format(idx, smile, len(res)))
    if len(res) == 0:
        dark_smiles.append(smile)

If you also want to know the similarity score, replace only(['molecule_chembl_id']) with only(['molecule_chembl_id', 'similarity']).

PLEASE NOTE: I run your entire 12k example and I didn't get any proxy timeout in the process. It still took several hours to complete. Now smiles from this file are in API cache so it will work much faster (several minutes). If you provide new smiles not know to the API yet it will bahave slower but still much faster than the last time and you should see any proxy timeouts anymore.

@Swarchal - can you please confirm if this solves your problem?

Swarchal commented 6 years ago

Just tried the script above and it ran without error. Thanks for your work on this, it's a great tool!

mnowotka commented 6 years ago

Perfect! I'm closing this but feel free to reopen in case of any more proxy timeouts.