What is this

When you use an instance of a data discovery class it has the option to reuse HTML from previous requests

avoid unnecessary network activity when the data updated on a predictable cadence
reduces the likelihood of your IP getting blocked
allows discovery good to run faster

How this should work

Create a class that abstracts network calls for HTML
Using this class should be as simple as existing logic for fetching the HTML maybe simpler

Example of use

Initial example

from urllib.request import url open
from bs4 import BeautifulSoup

class MyResourceFetcher:
    def __init__(self, url: str):
        self._url = url

    def get_list_items(self):
        response = urlopen(self._url)
        soup = BeautifulSoup(response.read(), 'html.parser')
        return soup.find_all('li')
    @staticmethod
    def create():
        return MyResourceFetcher(url='https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php')

Updated example

from lib.remote_file import RemoteFile, CacheCadence, CacheCountDownDate

class MyResourceFetcher:
    def __init__(self, html: RemoteFile):
        self._html = html

    def get_list_items(self):
        soup = BeautifulSoup(self._html.read(), 'html.parser')
        return soup.find_all('li')

    @staticmethod
    def create():
        lv_summaries_html = RemoteFile(
            id='nswvg_lv_directory',
            extension='html',
            url='https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php',

            # this is the amount of time that would have to elapse before the cache would invalidate
            cache_cadence=CacheCadence.month(1),

            # for example if you fetched the resource on the 13th or the 5th or the 30th of May 2012.
            # The countdown for the cache expiring would be from the `1st of May 2012` and if the
            # resource was was ever requested after that it will get a new copy of the resource
            cache_cadence_start=CacheCountDownDate.start_of_the_month(),

            # if any of these errors occur then use the cache
            use_cache_on_error=[404, *range(500, 600)],

            # if you want to use the cache offline
            use_cache_offline=True,
        )

        return MyResourceFetcher(html=lv_summaries_html)

Questions to answer

Establish where the HTML is stored.
- Obvious solution to me is in web-out
Establish where the where the cache count down date is stored.
- It's probably simple enough to just put in the file name
How does the CachedHTML find the latest version of the cache
1. I think it can probably just check the directory it's stored in
2. files will be stored with predictable file name something like remotefile-{id}-{date}.{ext}

AKST / Australian-Address-Boundaries-Land-Property-Price-Database

Cache HTML during used during data discovery #2

What is this

How this should work

Example of use

Questions to answer