chembl / chembl_webservices_2

Source code of the ChEMBL web services.
https://www.ebi.ac.uk/chembl/ws
Other
16 stars 3 forks source link

Duplicate activities close to chunk bonduaries #115

Closed mnowotka closed 7 years ago

mnowotka commented 7 years ago

It looks like activity endpoint is returning duplicate activities although it should provide a list of unique activities. What is more interesting the duplicates are close to 1000-sized chunk boundaries (and in fact cache is implemented to keep results in chunks of 1k records with fixed offset (0-based) regardless of the actual limit and offset parameters). This may indicate a serious bug in pagination implementation and may affect other endpoints as well. An example code is below.

https://gist.github.com/flatkinson/fd6737ed9815784e41982eaff441a561

mnowotka commented 7 years ago

For example by changing offset so the result set span across two 1k chunks form cache we will get a duplicate on the same page:

https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id=CHEMBL5619&limit=1000&offset=2500

Or in order to narrow limit as much as possible to see the smallest result set with duplicates: https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id=CHEMBL5619&limit=3&offset=2998

Now as we have a limit and offset narrowed to (3,2998) we can apply it to other endpoints, such as molecule:

https://www.ebi.ac.uk/chembl/api/data/molecule.json?molecule_properties__full_mwt__gte=100&limit=3&offset=2998

And we will have duplicates in the same places.

Interestingly, this problem only occurs if at least on filter is applied to the endpoint. And it doesn't happen for some endpoints, for example:

https://www.ebi.ac.uk/chembl/api/data/mechanism.json?mec_id__gt=1&limit=3&offset=2998 https://www.ebi.ac.uk/chembl/api/data/target.json?organism__istartswith=Homo&limit=3&offset=2998

are fine.

mnowotka commented 7 years ago

Fixed.