covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
365 stars 180 forks source link

Pulling from ArcGIS REST API requires supporting pagination #832

Closed shaperilio closed 4 years ago

shaperilio commented 4 years ago

Description

There are two ways to get data from ArcGIS sources: snooping around for the "opendata" CSV, and using the source's REST API. The REST API is preferred in my opinion; you sometimes have to wait for the CSV to be generated (see PR) and I've gotten stale data before. See [doc[(https://docs.google.com/document/d/1__rE8LbiB0pK4qqG3vIbjbgnT8SrTx86A_KI7Xakgjk/edit)

Say that you find an ArcGIS REST API URL, e.g. this one for Japan: https://services8.arcgis.com/JdxivnCyd1rvJTrY/arcgis/rest/services/v2_covid19_list_csv/FeatureServer/0/query?where=0%3D0&outFields=*&f=json

If you go to everything up to that last "0", you get a description with an important configuration parameter, "Max Record Count". Japan has this at 10k, and returns a list of cases, so if the total number of cases in japan ever exceeds 10k, we won't see the count increase.

Pagination is handled with two query fields: resultOffset and resultRecordCount (you can see a query GUI if you change f=json to f=html in the URL).

If your query exceeds the max record count, the resulting JSON will have a field "exceededTransferLimit": true in it.

What we need to do

We should have a fetch.arcGISjson(featureLayerURL, ...) which is aware of this. featureLayerURL should be everything up to that last zero, e.g.

https://services8.arcgis.com/JdxivnCyd1rvJTrY/arcgis/rest/services/v2_covid19_list_csv/FeatureServer/0

To that, we need to add the minimum query statement, which I believe is:

where=0%3D0&outFields=*&f=json (%3D is =)

NOTE: it's probably a good idea to add returnExceededLimitFeatures=true to the query; it seems on by default, I can't imagine why anyone would not want to do that and still give us data, and I can't find documentation on it.

The fetcher should check if the response has "exceededTransferLimit": true in it. If yes, then query again, with

where=0%3D0&outFields=*&resultOffset=n&f=json

where n is equal to the length of the features key in the JSON result. (I don't think we need the resultRecordCount parameter in there).

Querying should continue until "exceededTransferLimit" disappears.

Alternative approach

Because I can't find any documentation on returnExceededLimitFeatures=true as a query parameter, and that's scary, an alternative is to just assume there will always be a limit and do forced pagination. That is, the query should be:

where=0%3D0&outFields=*&resultOffset=n&resultRecordCount=k&f=json

Say we decide k = 500 (probably a safe guess; default is 2000 but I think I've seen sources with 1000), and then just loop with n = 0, n = 500, n = 1000 until features disappears.

Note

Apparently the basic query I've outlined here returns quite a bit of configuration fields. At least for Japan, we get "maxRecordCount": 10000.

If it were up to me, I'd vote for the "forced pagination" I propose in the "Alternative approach" above.

jzohrab commented 4 years ago

This is implemented in Li; I need to create another issue for pagination for the rest of the sources, but it's implemented for Japan. Thanks for the notes here and in the code elsewhere, it was very helpful for the implementation! jz