Closed Swarchal closed 6 years ago
Interesting! Thanks for reporting this, I'll have a look.
@Swarchal - can you please paste the list of your smiles if possible? If not, at least one particular that fails? (The list would be better, I could add it to my acceptance tests)
Ah if case this 'doesn't seem to fail at a particular smile string, and as it caches, if I re-run it does make progress' I'm not so sure if there is anything that can be done here. If you provide a large enough SMILES string with small enough threshold that yields thousands of results then time taking to collect them will exceed the Apache timeout and you will get 502. Next time result will be taken from cache so there is a chance you will get the correct results. I may implement this asynchronously as ChEMBL grows but this is not a trivial change. Increasing gateway timeout may solve a problem in most cases but not all of them. Faster catridge and sharding also may help but as I said this won't be an immediate fix.
I suggest you can either hammer the API for as long as you will get correct results or download smiles and use chemfp while I come with some better solution on the API side of things. Still a representative set of SMILES would be helpful.
Also an information that it used to be faster would be helpful in which case I can raise the issue with our DBA team.
Wow, quick response.
I've ran the same list of smiles before without issues, but that was with a higher similarity threshold (85).
Here's a superset of the smile strings, the ~1,000 I'm using in the code are within there -- hush-hush data and all that.
Perfect, I'll have a look. General note is that as the threshold goes lower, exponentially more similar compounds are found.
It runs without issue if I increase similarity from 70 => 75.
Good to know, I also checked and the cartidge is in heavy use at the moment as we are pregenerating substructure search cache for the bugfix release on Monday. So please rerun your stuff nex week but I'll try as well and probably during the release tune the timeout so your compounds will (mostly) pass next time.
This should be much faster now and no 502 erorrs anymore. @Swarchal, can you please check?
Just tried again with the master branch, seem to be getting the same error, but it ran much longer before returning an exception.
Traceback (most recent call last):
File "test_chembl_fix.py", line 35, in <module>
if len(res) == 0:
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/query_set.py", line 98, in __len__
return len(self.query)
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 150, in __len__
self.get_page()
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 383, in get_page
handle_http_error(res)
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/http_errors.py", line 113, in handle_http_error
raise exception_class(request.url, request.text)
chembl_webresource_client.http_errors.HttpBadGateway: Error for url https://www.ebi.ac.uk/chembl/api/data/similarity.json, server response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/chembl/api/data/similarity.json">POST /chembl/api/data/similarity.json</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.2.15 (Red Hat) Server at www.ebi.ac.uk Port 80</address>
</body></html>
OK, just to clarify: no changes have been made to the client. On the server side I:
sync
to genevnt
so long running task won't block other requests.yandex.tank
and as a result increased the number of workers on a single machine from 8 to 24.One thing I don't understand is why the client ignores TOTAL_RETRIES
setting which defaults to 3. I'll check this but this still won't solve the problem of similarity running slow, I need to profile SQL statements.
OK, I've spent some time on this and I belive this is fixed now. Please do the following:
from chembl_webresource_client.new_client import new_client
similarity_query = new_client.similarity
dark_smiles = []
with open('12K_smile_strings.smi') as f:
content = f.readlines()
for idx, line in enumerate(content):
smile = line.strip()
res = similarity_query.filter(smiles=smile, similarity=70).only(['molecule_chembl_id'])
print("{0} {1} {2}".format(idx, smile, len(res)))
if len(res) == 0:
dark_smiles.append(smile)
If you also want to know the similarity score, replace only(['molecule_chembl_id'])
with only(['molecule_chembl_id', 'similarity'])
.
PLEASE NOTE: I run your entire 12k example and I didn't get any proxy timeout in the process. It still took several hours to complete. Now smiles from this file are in API cache so it will work much faster (several minutes). If you provide new smiles not know to the API yet it will bahave slower but still much faster than the last time and you should see any proxy timeouts anymore.
@Swarchal - can you please confirm if this solves your problem?
Just tried the script above and it ran without error. Thanks for your work on this, it's a great tool!
Perfect! I'm closing this but feel free to reopen in case of any more proxy timeouts.
I keep running into 502 errors when searching for similar molecules based on smile strings. I'm not sure if it's just a bad couple of days for the EMBL servers, or if there's something wrong with the way I'm querying this?
It doesn't seem to fail at a particular smile string, and as it caches, if I re-run it does make progress.
Raises: