Open MattCowgill opened 5 years ago
Have you started on this @MattCowgill ? Do you have any thoughts on how to check the "AND is up to date" condition?
Hi @daviddiviny, @HughParsonage has done some work on this but (unless I've misunderstood) hasn't yet done the "and is up to date" part.
The only ways I can think to infer whether or not a file is up to date are:
Thoughts?
& FYI I haven't done much work on readabs
in a little while - my plan is to get ggannotate
ready for CRAN (still a way off) and then come back to this. I've also been discussing things with various ABS people -- I'm more confident than I was that readabs
will survive the transition to a new ABS website.
I would be happy with an argument to read_abs
that just does the first part. Unless there is an argument I am already missing, I'd rather not have to rewrite code to use read_abs_local
?
Is the alternative to just make the argument use_local
and set it to FALSE
by default? That way the user can specify?
I want to revive/revisit this...
The function readabs::check_latest_date()
already does a fair bit of the work.. Given a catalogue number (and optional table number(s)) or series ID, it queries the ABS Time Series Directory and returns the latest release date corresponding to that table/catalogue/series. So this does what I'm thinking of:
library(readabs)
# First, get some data the usual way
path <- tempdir()
read_abs("6202.0", 1, path = tempdir())
# Now, re-download it if it has been updated; otherwise load the local version
lfs_latest <- check_latest_date("6202.0", 1)
lfs <- read_abs_local("6202.0", path = path)
if (lfs_latest > max(lfs$date)) {
lfs <- read_abs("6202.0", 1, path = tempdir)
}
I have some version of that in various analysis scripts of my own, but I think it would be useful functionality to build into the package. I'm just scratching my head a bit about the best way forward. I'd be grateful for any thoughts!
My instinct would be that readabs
users would never download directly from the ABS website, and instead query a curated GitHub API[*] that is built by some regular, not too frequent process. The process provides this information and is under Matt's control so it can place nice with the package's end users. This way the action of querying the Time Series Directory is likely a lot faster, as well as reducing the cost of inadvertently updating.
[*] I'm not talking about something fancy; I'm more thinking of a basic text file, hosted on this repository that says when each catalogue was last updated.
I'm not sure I understand what you have in mind, @HughParsonage. At the moment, the process goes:
read_abs()
translates that into a query, queries the ABS TSD, and obtains the URL(s) for the requested table(s) If I understand correctly, your proposal would modify step (2). When a user requests a table, the URL for that table would be obtained from a text file hosted in this repo rather than directly from the ABS TSD. Have I understood correctly? If so, I'm not sure how that relates to the "check if updated" process, other than probably saving a fraction of a second (because querying GitHub will likely be faster than the ~0.5 seconds it takes to query the ABS TSD).
As I understand, the fundamental problems this feature request is trying to solve is that there is a tradeoff between downloading every time and using local files. Downloading every time is much slower but using local risks not being up-to-date. So if we can reduce the time it takes to search, download, and clean the tables to a short enough time that the tradeoff is negligible, we've solved the problem.
So now considering using the existing method of going from a user request for a table to the cleaned table itself as an automated, regular operation that stores the cleaned table for each request and the metadata associated with each table. Then the user-visible functions of readabs will only need to access this metadata file and, if the file requires updating, the data stored.
Much depends on the real timing differences of these approaches vis-à-vis typical user operations. One could, for example, download the metadata file .onLoad
and then refer to it in memory thereafter. Then readabs(<table>)
would first look at this metadata file, determine whether the local file is up-to-date, and then read locally or download as required. Continuing with this theme, each user could essentially mirror the GitHub repository data, with the update happening on demand with reference to a local metadata file and the metadata file downloaded .onLoad
. Naturally, this would eliminate the time cost of tidying/cleaning the data for the user, as well reducing the cost of unnecessary downloads to only the metadata file (and then, only once per session).
Your TSD looks like it has columns that could help you decide whether to invalidated your local cached copy:
glimpse(xml_dfs)
# Rows: 114
# Columns: 18
# $ ProductNumber <chr> "6202.0", "6202.0", "6202.0", "6202.0", "6202.0", "…
# $ ProductTitle <chr> "Labour Force, Australia", "Labour Force, Australia…
# $ ProductIssue <date> 2022-02-01, 2022-02-01, 2022-02-01, 2022-02-01, 20…
# $ ProductReleaseDate <date> 2022-03-17, 2022-03-17, 2022-03-17, 2022-03-17, 20…
# $ ProductURL <chr> "https://www.abs.gov.au/statistics/labour/employmen…
# $ TableURL <chr> "https://www.abs.gov.au/statistics/labour/employmen…
# $ TableTitle <chr> "Table 1. Labour force status by Sex, Australia - T…
# $ TableOrder <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,…
# $ Description <chr> "Employed total ; Persons ;", "Employed total ; P…
# $ Unit <chr> "000", "000", "000", "000", "000", "000", "000", "0…
# $ SeriesType <chr> "Trend", "Seasonally Adjusted", "Original", "Trend"…
# $ DataType <chr> "20", "20", "20", "20", "20", "20", "20", "20", "20…
# $ Frequency <chr> "Month", "Month", "Month", "Month", "Month", "Month…
# $ CollectionMonth <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
# $ SeriesStart <date> 1978-02-01, 1978-02-01, 1978-02-01, 1978-02-01, 19…
# $ SeriesEnd <date> 2022-02-01, 2022-02-01, 2022-02-01, 2022-02-01, 20…
# $ NoObs <chr> "529", "529", "529", "529", "529", "529", "529", "5…
# $ SeriesID <chr> "A84423127L", "A84423043C", "A84423085A", "A8442311…
Could you use the ProductIssue
, ProductReleaseDate
, SeriesEnd
or SeriesID
columns to make this decision?
(Also, I had a look at whether you could use HTTP caching tools - eg. with {pins}
, but unfortunately the ABS uses CloudFront as a server-side cache, and it looks like it strips the local caching info out afaict 😩)
Hi @jimjam-slam: yes!
The readabs::check_latest_date()
function returns the maximum value of the SeriesEnd
column, and this can be compared to the maximum date in downloaded spreadsheet(s)
if local file exists AND is up to date, load local file if not, get file from ABS
This could be used to form a new argument to
read_abs()
, something liketry_local = TRUE