Add local caching for requests to NBNco

LukePrior / nbn-upgrade-map

Interactive map showing premises eligible for the NBN FTTP upgrade program.

https://nbn.lukeprior.com/

MIT License

124 stars 11 forks source link

Add local caching for requests to NBNco #36

Closed lyricnz closed 1 year ago

lyricnz commented 1 year ago

Performance, debugging, adding features.

lyricnz commented 1 year ago

Naive implementation:

def cached_get_json(url):
    """Gets a JSON response from a URL, caching the result in a file."""
    cache_file = f'cache/{url.replace("/","_").replace(":","_")}.json'
    if os.path.exists(cache_file):
        with open(cache_file, "r") as infile:
            return json.load(infile)
    else:
        r = requests.get(url, stream=True, headers={"referer":"https://www.nbnco.com.au/"})
        with open(cache_file, "w") as outfile:
            json.dump(r.json(), outfile)
        return r.json()

Then replace

-        r = requests.get(lookupUrl + urllib.parse.quote(address["name"]), stream=True, headers={"referer":"https://www.nbnco.com.au/"})
-        locID = r.json()["suggestions"][0]["id"]
+        locID = cached_get_json(lookupUrl + urllib.parse.quote(address["name"]))["suggestions"][0]["id"]

and

-        r = requests.get(detailUrl + locID, stream=True, headers={"referer":"https://www.nbnco.com.au/"})
-        status = r.json()
+        status = cached_get_json(detailUrl + locID)

lyricnz commented 1 year ago

That's probably too short-sighted for massive (all suburb) runs. And a more specific address->locID and locID->status lookup would be tidier.

LukePrior commented 1 year ago

This is an interesting idea because I strip most of the data from the response but lots of it can be useful for others.

It would be interesting to see if these complete files could be uploaded to GitHub releases or something to save other people having to smash the NBN API.

Only problem is the limited storage space for GitHub actions (50GB or something)

lyricnz commented 1 year ago

With a more surgical cache (address->locID), and in a single lookup file (or one per URL-type), I don't think they'd be too large. The raw data, like the code above, is far too large. My own suburb is 42MB and still downloading :)

lyricnz commented 1 year ago

Something like https://docs.python.org/3/library/shelve.html maybe As used by https://github.com/ubershmekel/filecache

lyricnz commented 1 year ago

Here's a slightly-less naive version, using the file-cache module linked above. If you want to minimise deps, we could reimplement some of that. The code above stores a cache in code/main.py.cache for 1 month

local-cache.patch

I guess the address cache could be longer, and maybe the status could be shorter.

LukePrior commented 1 year ago

I believe the GNAF database has a unique ID for every address and LocID shouldn't change so we could have a lookup file using that. I'll have a look at your code.

lyricnz commented 1 year ago

Do you want to chat about this? linkedin, discord, etc?

LukePrior commented 1 year ago

Do you want to chat about this? linkedin, discord, etc?

I sent you a message on LinkedIn

lyricnz commented 1 year ago

I guess a design is: two caches, one for address->locid (semi-permanent), and one for locid->status (expires)

The first cache could use a unique id from the DB table (rather than an address string), to make the cache smaller.

The second could be a simple shelve or similar, or even a json file. Can't store all of them though, too big.

CI (GHA) may run in a slightly different cache mode to people running locally

lyricnz commented 1 year ago

Alternate implementations, of a gnaf_pid->loc_id cache on a dataset with 5967 addresses:

naive HTTP cache above: 29MB
using diskcache (sqlite): 768KB
using json: 268KB
using pickle: 256KB

lyricnz commented 1 year ago

PR #59 now implements a 1GB disk-cache (sqllite) for gnaf_pid->loc_id and loc_id->status

lyricnz commented 1 year ago

Merged