FDA / openfda

openFDA is an FDA project to provide open APIs, raw data downloads, documentation and examples, and a developer community for an important collection of FDA public datasets.
https://open.fda.gov
Creative Commons Zero v1.0 Universal
572 stars 132 forks source link

"Skip value must 25000 or less" prevents access to adverse events data #108

Closed beckyconning closed 4 years ago

beckyconning commented 4 years ago

Why should the skip value be less than 25000? If I have the required requests remaining this minute / day why can't I do this from a single query rather than having to split them up by receiveddate?

beckyconning commented 4 years ago

For example say I want all 83622 records about adverse events in cats. This is not possible without arbitrary windowing of the data to enable a secondary form of pagination as that is more than 25000 results.

API usage limits (240 requests per minute 120000 a day) make complete sense however the restriction on skip amount is redundant and prevents valid use cases.

dkrylovsb commented 4 years ago

@beckyconning The thing is, we are using Elasticsearch in the backend. The skip parameter cannot exceed the index.max_result_window index setting, which we have already upped to 25100 from the default of 10K. Search requests take heap memory and time proportional to skip + limit and index.max_result_window limits that memory. Setting index.max_result_window too high can increase strain on the infrastructure.

beckyconning commented 4 years ago

Thanks for the informative response : ) @dkrylovsb Would you be able to suggest an approach to retrieving all the adverse reactions for cats?

beckyconning commented 4 years ago

is there a way to leverage search-after for pagination instead of from?https://www.elastic.co/guide/en/elasticsearch/reference/6.4/search-request-search-after.html

beckyconning commented 4 years ago

this might improve performance of > 10k < 25k paginated queries whilst also enabling > 25k queries.

dkrylovsb commented 4 years ago

@beckyconning Unfortunately, the version of Elasticsearch currently being used in openFDA does not support the Search After feature.

Have you been able to work around the current skip limit of 25K?

beckyconning commented 4 years ago

Not in a reasonable way. I've been contacted by other multiple people (researchers, journalists, data scientists) who have the same issue as well.

beckyconning commented 4 years ago

@dkrylovsb any chance that search after will be supported? there isn't a way to get the data reliably and each page is very slow to retrieve.

dkrylovsb commented 4 years ago

@beckyconning Do you have any specific max skip limit value in mind that would help you achieve your research goals?

We will look into supporting a Search After parameter. That will require an Elasticsearch upgrade across major versions, which by itself is a significant undertaking.

beckyconning commented 4 years ago

Awesome thanks for pushing this forwards @dkrylovsb. For us it would be nice to retrieve any dataset from the OpenFDA API that our users ask for. Sometimes it just falls within the limits, sometimes it doesn't.

For example adverse event reports which include a reference to drugs indicated for osteoarthritis just fits under 25k. At the same time one researchers query for reports involving a drug that's commonly prescribed for autism was over 80k.

Additionally load times are slow due to the data retrieval method and of course that can get worse if the limit is relaxed without switching to search after.

Thanks so much!

beckyconning commented 4 years ago

Happy to donate my weekends if I can be of any use. I'm an experienced programmer so I'm happy to take on any aspects of this which would be useful. I can meet during the week to discuss tasks etc.

beckyconning commented 4 years ago

Hi! Can I be of any use? Just need a dev env with sample data to get going. Thanks!

dkrylovsb commented 4 years ago

Thank you for your generous offer @beckyconning ; unfortunately, Elasticsearch upgrade is something that would have to be done by our team according to priorities. I will post back once I have news to report.

beckyconning commented 4 years ago

Thanks so much @dkrylovsb, looking forward to this upgrade : )

dkrylovsb commented 4 years ago

This has been implemented: https://open.fda.gov/apis/paging/

beckyconning commented 4 years ago

Whoa

beckyconning commented 4 years ago

🎉 thank you so much!