When you use an instance of a data discovery class it has the option to reuse HTML from previous requests
avoid unnecessary network activity when the data updated on a predictable cadence
reduces the likelihood of your IP getting blocked
allows discovery good to run faster
How this should work
Create a class that abstracts network calls for HTML
Using this class should be as simple as existing logic for fetching the HTML maybe simpler
Example of use
Initial example
from urllib.request import url open
from bs4 import BeautifulSoup
class MyResourceFetcher:
def __init__(self, url: str):
self._url = url
def get_list_items(self):
response = urlopen(self._url)
soup = BeautifulSoup(response.read(), 'html.parser')
return soup.find_all('li')
@staticmethod
def create():
return MyResourceFetcher(url='https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php')
Updated example
from lib.remote_file import RemoteFile, CacheCadence, CacheCountDownDate
class MyResourceFetcher:
def __init__(self, html: RemoteFile):
self._html = html
def get_list_items(self):
soup = BeautifulSoup(self._html.read(), 'html.parser')
return soup.find_all('li')
@staticmethod
def create():
lv_summaries_html = RemoteFile(
id='nswvg_lv_directory',
extension='html',
url='https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php',
# this is the amount of time that would have to elapse before the cache would invalidate
cache_cadence=CacheCadence.month(1),
# for example if you fetched the resource on the 13th or the 5th or the 30th of May 2012.
# The countdown for the cache expiring would be from the `1st of May 2012` and if the
# resource was was ever requested after that it will get a new copy of the resource
cache_cadence_start=CacheCountDownDate.start_of_the_month(),
# if any of these errors occur then use the cache
use_cache_on_error=[404, *range(500, 600)],
# if you want to use the cache offline
use_cache_offline=True,
)
return MyResourceFetcher(html=lv_summaries_html)
Questions to answer
Establish where the HTML is stored.
Obvious solution to me is in web-out
Establish where the where the cache count down date is stored.
It's probably simple enough to just put in the file name
How does the CachedHTML find the latest version of the cache
I think it can probably just check the directory it's stored in
files will be stored with predictable file name something like remotefile-{id}-{date}.{ext}
What is this
When you use an instance of a data discovery class it has the option to reuse HTML from previous requests
How this should work
Example of use
Initial example
Updated example
Questions to answer
web-out
remotefile-{id}-{date}.{ext}