USDA / USDA-APIs

Do you have feedback, ideas, or questions for USDA APIs? Use this repository's Issue Tracker to join the discussion.
www.usda.gov/developer
107 stars 16 forks source link

FDC API returns 500 when trying to access page +200 #86

Open markcutajar opened 4 years ago

markcutajar commented 4 years ago

Hi, i am attempting to access the FDC api pages 200+ however, the API returns 500.

curl --location --request POST 'https://<apikey>@api.nal.usda.gov/fdc/v1/search?format=json%3C=f&sort=n' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: Basic = <token>' \
--data-raw '{"generalSearchInput":"", "pageNumber": 201}'

Returns

{
    "timestamp": "2020-01-13T11:57:28.425+0000",
    "status": 500,
    "error": "Internal Server Error",
    "message": "all shards failed",
    "path": "/portal-data/api/v1/search"
}

NB: I'm rate limiting 1 request per second as per the documentation and this issue shows up even when i access that page in time isolation. This doesn't seem to be a rate issue.

Any ideas on the cause?

littlebunch commented 4 years ago

@markcutajar Yes, it's caused by Elasticsearch's concept of a maximum result size. The site is currently configured for 10,000 documents as the maximum size for any given search. If you're requesting 50 documents per 'page', then you hit this limit at page 201, i.e. 201*50=10,050. BTW, the error message could be a little more informative.

We can increase the maximum result size but the Catch-22 in doing so is that this increases the amount of memory required. There are currently 5,885 "pages" in branded foods. So, we would need a maximum result size of ~294,000 (5,885*50) to cover all "pages". I would be reluctant to config this without a lot of testing prior. We could probably increase the max result size to, say, 50K which would allow a browse up to 1,000 pages if that would be helpful. I could check with the developers.

I don't know enough about Elasticsearch to understand why this limitation is in place for paged browsing. I don't recall it being an issue in Solr. Apparently there are workarounds available in Elasticsearch, e.g. a scrolling API, which should/need to be included in the pending "browse" endpoint we're proposing.

markcutajar commented 4 years ago

@littlebunch thank you for the in-detail answer. I understnad Elasticsearch has a maximum result size endpoint, however, not sure how this is still in effect with paged browsing since the from and end parameters are given to limit in memory loading.

The scrolling API would be a good alternative too, is there a timeline on this? Another note, is there a way to get updated items since a specified date?

littlebunch commented 4 years ago

@markcutajar Yeah, this is surprisingly difficult with e/s. Here's the article I've sent to the developers. Like I say, I never had this issue with Solr.

pappakrishnan commented 4 years ago

Is there any other way to collect all data? I am running into the same issue. Even if start from page 201, I am getting the same error.

littlebunch commented 4 years ago

@pappakrishnan All data are available for downloading. It is csv which can, of course, be converted into JSON or written into an RDBMS or Access or Excel or whatever. Here's an example of loading into a NoSQL.

jeff-foodgraph commented 2 years ago

+1 on a fix for this!

MrAda commented 2 years ago

I saw this and this is the wall I am coming across as well. I am doing 200/page and at 51 it bombs. This only happens for branded foods. I really like the abridged output but no one at FDC can tell you how to write a SQL command to get it out that way when you import the csv files into a SQLite database tool. What is amazing is that their online webpage shows the over 300K plus entries and doesn't crash. But of course that data is not in the format I am needing.