USDA / USDA-APIs

Do you have feedback, ideas, or questions for USDA APIs? Use this repository's Issue Tracker to join the discussion.
www.usda.gov/developer
107 stars 16 forks source link

Food Central API errors on pageNumber >= 100 #100

Open bradleybernard opened 3 years ago

bradleybernard commented 3 years ago

I saw a similar issue reported already for pageNumber >= 100 returning:

{
    "timestamp": "2020-08-12T22:22:50.220+0000",
    "status": 500,
    "error": "Internal Server Error",
    "message": "all shards failed",
    "path": "/portal-data/api/v1/foods/list"
}

but was wondering if there was a follow up. I'm looking to integrate with the latest data, and the download page for the raw CSVs are updated April and it is now August. I'd like to be able to get the most up to date data, since I've found a product that returns information for a UPC code from the API but doesn't exist in the flat file database (CSV) export.

Is there any other way to get the most up to date version of the data the API is using, without relying on manual updates to the download data page?

bradleybernard commented 3 years ago

https://github.com/USDA/USDA-APIs/issues/86

This is the issue I was referring to

hphungnal commented 3 years ago

@bradleybernard Sorry for the inconvenience but unfortunately, there is no plan to address this issue at the moment. However, as a workaround, you can search for only Branded foods and sort by the published date to get latest updates.

As you saw, the CSVs were last updated in April. However, this is the latest update for Foundational, SR Legacy, and Survey foods. As for Branded foods, we have monthly updates, usually published at the end of the month (last Thursday). Therefore, you can query for dataType=Branded&sortBy=publishedDate each month to be up to date with the latest data.

bradleybernard commented 3 years ago

I see. Querying by dataType=Branded&sortBy=publishedDate I can see that we still hit the limit for page size in elastic search since there are over: 200 * 50 branded items since the last publish date for the CSVs in April.

The last item in the current set for query: v1/foods/list: dataType=Branded&sortBy=publishedDate&sortOrder=desc, pageNumber=50&pageSize=200 returns:

"fdcId": 992565,
"description": "BERRY FRUITFUL ORGANIC WHOLE WHEAT BISCUITS CEREAL, BERRY FRUITFUL",
"dataType": "Branded",
"publicationDate": "2020-06-26"

So I can't access the items from 2020-04-29 (CSV export date) --> 2020-06-26 due to paging issues. @hphungnal Is there a solution for that? I realize I could slice the data in various ways to try to get all the elements but I don't see that being scalable or feasible.

bradleybernard commented 3 years ago

@hphungnal any suggestions?

littlebunch commented 3 years ago

Might be worth mentioning that in addition to the E/S issue(s) noted here, there are several fields in the csv format that are not present in the list endpoint formats. Most notably, serving (or portion data), modified date, available date, market country and food category. There may be others that I’ve missed. So, unfortunately, I can’t really use the list endpoint to directly update my data stores, e.g. https://rs.littlebunch.com or https://go.littlebunch.com/doc/. One would have to use the list endpoint to gather a list of FDC ID’s to pass to the foods endpoint 20 at a time to fetch the complete data. I’m looking into more efficient method(s).

hphungnal commented 3 years ago

@bradleybernard Apologies for getting back to you so late. Unfortunately, I do not think there is a workaround to getting all the Branded data updates from May until now. What I can do is provide you with a CSV with all of the Branded FDC_IDs that have been added since April - would that work for you? Then going forward, you can query for the monthly updates which should be under the 10k cap. Sorry for the inconvenience!

hphungnal commented 3 years ago

@littlebunch Hope you're doing well! I think we were trying to keep the fields limited in the AbridgedFoodDetails (which is what the list endpoint returns) so I am not sure if we want to increase the number of fields for it. That said, I am not sure when the next API update will happen to even propose this; currently the focus is on the DB replacement with neo4j.

I think what would be great is if that graphql idea you started is implemented; then users can request exactly what they need. But if you are building a data store, aren't the CSV downloads better for this? Do you need the monthly updated Branded FDC_IDs as well?

littlebunch commented 3 years ago

Doing well @hphungnal I don't really need the FDC_IDs list, thanks. I was just thinking through what someone might need to do to incorporate the monthly updates into their local datastore.

Yeah, it seems like Branded Foods has the well defined, structured data for a perfect graphql use case. I have a couple of proof-of-concepts available for anyone interested. The Golang project project and demo site uses a no-sql datastore while the Rust project and demo site uses a SQL datastore. Either would be a good starting point for highly performant services, IMO.

bradleybernard commented 3 years ago

@hphungnal @littlebunch I checked and see the same error on sharding past page 50 with page size 200 aka the 10,001 result fails. Do we know when the next CSV upload will be available for Branded Foods DB? Is there any progress on exposing the DB for direct download in some format, so there isn't work to publish the CSVs manually every few months?

hphungnal commented 3 years ago

Hi @bradleybernard new CSVs should be released tomorrow (10/30) or early next week. These will contain all the Branded foods up until September. There will still be monthly updates to the Branded foods going forward (October and November updates might be released together or October updates will be released mid November). I don't think there is any intention on exposing the DB at this point, especially since FDC will be going to a graph database. I've tried to start conversation on a v2 of the API to address some issues, including this one, but there is nothing definitive at the moment and the timeline is probably not going to be anytime soon, unfortunately.