Open markcutajar opened 4 years ago
@markcutajar Yes, it's caused by Elasticsearch's concept of a maximum result size. The site is currently configured for 10,000 documents as the maximum size for any given search. If you're requesting 50 documents per 'page', then you hit this limit at page 201, i.e. 201*50=10,050. BTW, the error message could be a little more informative.
We can increase the maximum result size but the Catch-22 in doing so is that this increases the amount of memory required. There are currently 5,885 "pages" in branded foods. So, we would need a maximum result size of ~294,000 (5,885*50) to cover all "pages". I would be reluctant to config this without a lot of testing prior. We could probably increase the max result size to, say, 50K which would allow a browse up to 1,000 pages if that would be helpful. I could check with the developers.
I don't know enough about Elasticsearch to understand why this limitation is in place for paged browsing. I don't recall it being an issue in Solr. Apparently there are workarounds available in Elasticsearch, e.g. a scrolling API, which should/need to be included in the pending "browse" endpoint we're proposing.
@littlebunch thank you for the in-detail answer. I understnad Elasticsearch has a maximum result size endpoint, however, not sure how this is still in effect with paged browsing since the from
and end
parameters are given to limit in memory loading.
The scrolling API would be a good alternative too, is there a timeline on this? Another note, is there a way to get updated items since a specified date?
@markcutajar Yeah, this is surprisingly difficult with e/s. Here's the article I've sent to the developers. Like I say, I never had this issue with Solr.
Is there any other way to collect all data? I am running into the same issue. Even if start from page 201, I am getting the same error.
@pappakrishnan All data are available for downloading. It is csv which can, of course, be converted into JSON or written into an RDBMS or Access or Excel or whatever. Here's an example of loading into a NoSQL.
+1 on a fix for this!
I saw this and this is the wall I am coming across as well. I am doing 200/page and at 51 it bombs. This only happens for branded foods. I really like the abridged output but no one at FDC can tell you how to write a SQL command to get it out that way when you import the csv files into a SQLite database tool. What is amazing is that their online webpage shows the over 300K plus entries and doesn't crash. But of course that data is not in the format I am needing.
Hi, i am attempting to access the FDC api pages 200+ however, the API returns 500.
Returns
NB: I'm rate limiting 1 request per second as per the documentation and this issue shows up even when i access that page in time isolation. This doesn't seem to be a rate issue.
Any ideas on the cause?