biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
581 stars 92 forks source link

PubChem throttle control does not properly work #596

Open padix-key opened 2 weeks ago

padix-key commented 2 weeks ago

Both search() and fetch() in database.pubchem use the PubChem throttle control to comply with the PubChem usage policy. This means ThrottleStatus.wait_if_busy() is called, whenever the server load reaches a certain threshold.

However, this system seems not to work correctly: Both in local setups and the CI biotite.database.RequestError: Too many requests or server too busy is raised indicating that the number of requests was too large. Furthermore the X-Throttling-Control header is sometimes missing in server responses, leading to KeyError: 'x-throttling-control'. Biotite should be able to handle this.

The question is, whether the exceeding request limit is a problem with the PubChem database or the throttle implementation in biotite.database.pubchem. For example, if already the first request exceeds the throttle threshold, this could be considered an issue with PubChem, because Biotite would have no way to find out before this request what the throttle status is.

Edit: I did some experimentation with many requests from my local setup and this problem never appeared. I assume it happens frequently in the CI due to us sharing the IP address with other users also accessing PubChem. A solution here would be using pytest.skip() in affected tests, if the RequestError appears.