chembl / chembl_webresource_client

Official Python client for accessing ChEMBL API
https://www.ebi.ac.uk/chembl/api/data/docs
Other
371 stars 95 forks source link

new_client columns filtering #18

Closed fabricecarles closed 7 years ago

fabricecarles commented 7 years ago

Hi, I try to use new_client instead of old client to get activities for a specific target. new_client.activity.filter(target_chembl_id="CHEMBL4040") Unfortunately the shape of resulting data is very hudge since new_client return 30 columns instead of 10 for old client. How can I filter my query to a list of specific columns to reduce both download time and data size ?

Sincerely

Fabrice Carles

PhD student

mnowotka commented 7 years ago

Hi,

You can't make a REST API call to get just some subset of columns but this never was an issue because all data produced by our API is cached (both on the server and client side) and you can filter out columns once you have them all.

Basically this is a constraint of the REST protocol not the client. In REST you can't ask for a subset of fields. On the other hand you can do it using grapQL (http://graphql.org/learn/) and this is what we are planning to support at some point in future.

fabricecarles commented 7 years ago

Hum, but I don't understand why do you choose to return by default 30 columns in the new_client.activity ? I presume that by default returned table of the REST API is a result of a sql view or selection query ? In my opinion by adding targets and compounds information to activity table you allow users to increase filtering options but the drawback is that you created duplicates in 2/3 of the returned columns and the total time required to download data in the new_client version is multiplied by more than 100 !

mnowotka commented 7 years ago

Not really, the fields that are available were carefully chosen in a way that satisfies the vast majority of users. I'm happy to see timings that would prove that the new client is more than 100x slower, to my best knowledge that's not the case. We are using the REST API to build complex applications (https://chembl-glados.herokuapp.com/) that uses all the fields to provide advanced filtering.

fabricecarles commented 7 years ago

I apologize I have a little underestimated the time factor is actually about 50-60 times slower with the new client

old client

%time old_df = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862')) CPU times: user 872 ms, sys: 156 ms, total: 1.03 s Wall time: 8.05 s

new client

bioacts = new_client.activity.filter(target_chembl_id="CHEMBL1862") %time new_df=pd.DataFrame.from_dict(list(bioacts)) CPU times: user 18.9 s, sys: 1.57 s, total: 20.5 s Wall time: 7min 10s

And the results are not exactely the same ...

print(old_df.shape) (11095, 16) print(new_df.shape) (11058, 31)

Any idea ?

mnowotka commented 7 years ago

Thanks for checking this. So yes, I agree that the new web services are slower than the old ones. But this doesn't have anything to do with the number of columns or bandwidth.

First of all, what you see is mostly related to the fact, that new API has a pagination. When you use the client you can't see this (this is why the client is so nice) but behind the scenes the new client fetches data in chunks. The default chunk size is 20 results but this can be increased up to 1000. Here is how you do it, just put this code at the very top of your script:

from chembl_webresource_client.settings import Settings
Settings.Instance().MAX_LIMIT = 1000

This has to be used BEFORE you import any other client related stuff. Please try that and you should see a significant improvement.

Still, there are some more performance issues especially related to 'activities' endpoint as it offers the largest amount of data (we have 14M activities). A new API release that will happen in about 2 weeks from now should fix this.

So I will keep this issue opened and when the new release is out I will ask you to rerun your tests again. If the results are fine I'll close the issue.

fabricecarles commented 7 years ago

Thanks, I will try with Settings.Instance().MAX_LIMIT = 1000 but before I have to clear the cache... With old client I just have to remove the .chembl_ws_client__0.8.50.sqlite but how to do this with the new client ? Otherwise , did you have an explication about the rows size difference of result between old and new client ? I will sed you an email about this ...

mnowotka commented 7 years ago

OK, so there are two separate things:

  1. If you want to disable caching temporarily, just for your script, this can be controlled using settings as well, just append this line:

    Settings.Instance().CACHING = False

    just after the:

    Settings.Instance().MAX_LIMIT = 1000
  2. If you want to delete the cache file, the new client changed the location of the cache file from the current directory (polluting the current directory doesn't make sense and the cache file is only local to this current directory) to the hidden file in the home directory. To see it just invoke:

    ls ~/.chembl_ws_client*

    Of course, you can change this default location using settings, just do

    Settings.Instance().CACHE_NAME = '/some/new/location.sqlite'
fabricecarles commented 7 years ago

Ok, thanks after removing all cached files and set Settings.Instance().MAX_LIMIT = 1000 for new client the time difference between old and new client is around a factor 25. This is better but could be best. Note that bandwidth dramatically impact the total download time by a factor 4 between my home network and my institute network. So in my opinion the observed difference could be due to size of dataset, and also the number of returned columns... Whatever the improvement there will always be a factor 2 between the old client and the new one because I saw that the size of the cache is twice as important in the new client, which is understandable because there are twice as much of columns. I suggest to look for doing the join between Activities Compound and Target table only on client after that download of data was done instead of sending thousand of duplicates rows (id,smiles and proteins descriptions) as it is currently the case in new_client.activity Anyway I will wait for the new version and I will test to see the difference. I hope I have made an interesting contribution to the project. Have a good day

mnowotka commented 7 years ago

Yes, thank was helpful, thank you. I hope we've managed to solve at least some of your problems.

mnowotka commented 7 years ago

Hi @fabricecarles. Can you please verify the API speed now? Can you see any improvements?

fabricecarles commented 7 years ago

Hi, Indeed, using client version 0.9.13 the results seem to be better

from chembl_webresource_client.settings import Settings
Settings.Instance().MAX_LIMIT = 1000
Settings.Instance().CACHING = False
### old client
%time old_df = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862'))
CPU times: user 443 ms, sys: 111 ms, total: 554 ms
Wall time: 1.59 s
### new client
bioacts = new_client.activity.filter(target_chembl_id="CHEMBL1862")
%time new_df=pd.DataFrame.from_dict(list(bioacts))
CPU times: user 1.16 s, sys: 217 ms, total: 1.38 s
Wall time: 18.8 s

Now the difference between old and new client is around a factor 10, thank you for this improvement. Fabrice