Closed lyricnz closed 1 year ago
Naive implementation:
def cached_get_json(url):
"""Gets a JSON response from a URL, caching the result in a file."""
cache_file = f'cache/{url.replace("/","_").replace(":","_")}.json'
if os.path.exists(cache_file):
with open(cache_file, "r") as infile:
return json.load(infile)
else:
r = requests.get(url, stream=True, headers={"referer":"https://www.nbnco.com.au/"})
with open(cache_file, "w") as outfile:
json.dump(r.json(), outfile)
return r.json()
Then replace
- r = requests.get(lookupUrl + urllib.parse.quote(address["name"]), stream=True, headers={"referer":"https://www.nbnco.com.au/"})
- locID = r.json()["suggestions"][0]["id"]
+ locID = cached_get_json(lookupUrl + urllib.parse.quote(address["name"]))["suggestions"][0]["id"]
and
- r = requests.get(detailUrl + locID, stream=True, headers={"referer":"https://www.nbnco.com.au/"})
- status = r.json()
+ status = cached_get_json(detailUrl + locID)
That's probably too short-sighted for massive (all suburb) runs. And a more specific address->locID and locID->status lookup would be tidier.
This is an interesting idea because I strip most of the data from the response but lots of it can be useful for others.
It would be interesting to see if these complete files could be uploaded to GitHub releases or something to save other people having to smash the NBN API.
Only problem is the limited storage space for GitHub actions (50GB or something)
With a more surgical cache (address->locID), and in a single lookup file (or one per URL-type), I don't think they'd be too large. The raw data, like the code above, is far too large. My own suburb is 42MB and still downloading :)
Something like https://docs.python.org/3/library/shelve.html maybe As used by https://github.com/ubershmekel/filecache
Here's a slightly-less naive version, using the file-cache module linked above. If you want to minimise deps, we could reimplement some of that. The code above stores a cache in code/main.py.cache
for 1 month
I guess the address cache could be longer, and maybe the status could be shorter.
I believe the GNAF database has a unique ID for every address and LocID shouldn't change so we could have a lookup file using that. I'll have a look at your code.
Do you want to chat about this? linkedin, discord, etc?
Do you want to chat about this? linkedin, discord, etc?
I sent you a message on LinkedIn
I guess a design is: two caches, one for address->locid (semi-permanent), and one for locid->status (expires)
The first cache could use a unique id from the DB table (rather than an address string), to make the cache smaller.
The second could be a simple shelve or similar, or even a json file. Can't store all of them though, too big.
CI (GHA) may run in a slightly different cache mode to people running locally
Alternate implementations, of a gnaf_pid->loc_id cache on a dataset with 5967 addresses:
PR #59 now implements a 1GB disk-cache (sqllite) for gnaf_pid->loc_id and loc_id->status
Merged
Performance, debugging, adding features.