MattCowgill / readabs

Download and tidy time series data from the Australian Bureau of Statistics in R
https://mattcowgill.github.io/readabs/
Other
102 stars 23 forks source link

add ability to check if local file exists & is up to date #46

Open MattCowgill opened 5 years ago

MattCowgill commented 5 years ago

if local file exists AND is up to date, load local file if not, get file from ABS

This could be used to form a new argument to read_abs(), something like try_local = TRUE

daviddiviny commented 4 years ago

Have you started on this @MattCowgill ? Do you have any thoughts on how to check the "AND is up to date" condition?

MattCowgill commented 4 years ago

Hi @daviddiviny, @HughParsonage has done some work on this but (unless I've misunderstood) hasn't yet done the "and is up to date" part.

The only ways I can think to infer whether or not a file is up to date are:

Thoughts?

MattCowgill commented 4 years ago

& FYI I haven't done much work on readabs in a little while - my plan is to get ggannotate ready for CRAN (still a way off) and then come back to this. I've also been discussing things with various ABS people -- I'm more confident than I was that readabs will survive the transition to a new ABS website.

daviddiviny commented 4 years ago

I would be happy with an argument to read_abs that just does the first part. Unless there is an argument I am already missing, I'd rather not have to rewrite code to use read_abs_local?

daviddiviny commented 4 years ago

Is the alternative to just make the argument use_local and set it to FALSE by default? That way the user can specify?

MattCowgill commented 2 years ago

I want to revive/revisit this...

The function readabs::check_latest_date() already does a fair bit of the work.. Given a catalogue number (and optional table number(s)) or series ID, it queries the ABS Time Series Directory and returns the latest release date corresponding to that table/catalogue/series. So this does what I'm thinking of:

library(readabs)

# First, get some data the usual way
path <- tempdir()
read_abs("6202.0", 1, path = tempdir())

# Now, re-download it if it has been updated; otherwise load the local version

lfs_latest <- check_latest_date("6202.0", 1)
lfs <- read_abs_local("6202.0", path = path)

if (lfs_latest > max(lfs$date)) {
  lfs <- read_abs("6202.0", 1, path = tempdir)
}

I have some version of that in various analysis scripts of my own, but I think it would be useful functionality to build into the package. I'm just scratching my head a bit about the best way forward. I'd be grateful for any thoughts!

HughParsonage commented 2 years ago

My instinct would be that readabs users would never download directly from the ABS website, and instead query a curated GitHub API[*] that is built by some regular, not too frequent process. The process provides this information and is under Matt's control so it can place nice with the package's end users. This way the action of querying the Time Series Directory is likely a lot faster, as well as reducing the cost of inadvertently updating.

[*] I'm not talking about something fancy; I'm more thinking of a basic text file, hosted on this repository that says when each catalogue was last updated.

MattCowgill commented 2 years ago

I'm not sure I understand what you have in mind, @HughParsonage. At the moment, the process goes:

  1. User requests some table(s) 2.read_abs() translates that into a query, queries the ABS TSD, and obtains the URL(s) for the requested table(s)
  2. The table(s) are downloaded from the URL(s) given by the TSD
  3. The tables are imported, tidied and returned to the user

If I understand correctly, your proposal would modify step (2). When a user requests a table, the URL for that table would be obtained from a text file hosted in this repo rather than directly from the ABS TSD. Have I understood correctly? If so, I'm not sure how that relates to the "check if updated" process, other than probably saving a fraction of a second (because querying GitHub will likely be faster than the ~0.5 seconds it takes to query the ABS TSD).

HughParsonage commented 2 years ago

As I understand, the fundamental problems this feature request is trying to solve is that there is a tradeoff between downloading every time and using local files. Downloading every time is much slower but using local risks not being up-to-date. So if we can reduce the time it takes to search, download, and clean the tables to a short enough time that the tradeoff is negligible, we've solved the problem.

So now considering using the existing method of going from a user request for a table to the cleaned table itself as an automated, regular operation that stores the cleaned table for each request and the metadata associated with each table. Then the user-visible functions of readabs will only need to access this metadata file and, if the file requires updating, the data stored.

Much depends on the real timing differences of these approaches vis-à-vis typical user operations. One could, for example, download the metadata file .onLoad and then refer to it in memory thereafter. Then readabs(<table>) would first look at this metadata file, determine whether the local file is up-to-date, and then read locally or download as required. Continuing with this theme, each user could essentially mirror the GitHub repository data, with the update happening on demand with reference to a local metadata file and the metadata file downloaded .onLoad. Naturally, this would eliminate the time cost of tidying/cleaning the data for the user, as well reducing the cost of unnecessary downloads to only the metadata file (and then, only once per session).

jimjam-slam commented 2 years ago

Your TSD looks like it has columns that could help you decide whether to invalidated your local cached copy:

glimpse(xml_dfs)
# Rows: 114
# Columns: 18
# $ ProductNumber      <chr> "6202.0", "6202.0", "6202.0", "6202.0", "6202.0", "…
# $ ProductTitle       <chr> "Labour Force, Australia", "Labour Force, Australia…
# $ ProductIssue       <date> 2022-02-01, 2022-02-01, 2022-02-01, 2022-02-01, 20…
# $ ProductReleaseDate <date> 2022-03-17, 2022-03-17, 2022-03-17, 2022-03-17, 20…
# $ ProductURL         <chr> "https://www.abs.gov.au/statistics/labour/employmen…
# $ TableURL           <chr> "https://www.abs.gov.au/statistics/labour/employmen…
# $ TableTitle         <chr> "Table 1. Labour force status by Sex, Australia - T…
# $ TableOrder         <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,…
# $ Description        <chr> "Employed total ;  Persons ;", "Employed total ;  P…
# $ Unit               <chr> "000", "000", "000", "000", "000", "000", "000", "0…
# $ SeriesType         <chr> "Trend", "Seasonally Adjusted", "Original", "Trend"…
# $ DataType           <chr> "20", "20", "20", "20", "20", "20", "20", "20", "20…
# $ Frequency          <chr> "Month", "Month", "Month", "Month", "Month", "Month…
# $ CollectionMonth    <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
# $ SeriesStart        <date> 1978-02-01, 1978-02-01, 1978-02-01, 1978-02-01, 19…
# $ SeriesEnd          <date> 2022-02-01, 2022-02-01, 2022-02-01, 2022-02-01, 20…
# $ NoObs              <chr> "529", "529", "529", "529", "529", "529", "529", "5…
# $ SeriesID           <chr> "A84423127L", "A84423043C", "A84423085A", "A8442311…

Could you use the ProductIssue, ProductReleaseDate, SeriesEnd or SeriesID columns to make this decision?

jimjam-slam commented 2 years ago

(Also, I had a look at whether you could use HTTP caching tools - eg. with {pins}, but unfortunately the ABS uses CloudFront as a server-side cache, and it looks like it strips the local caching info out afaict 😩)

MattCowgill commented 2 years ago

Hi @jimjam-slam: yes! The readabs::check_latest_date() function returns the maximum value of the SeriesEnd column, and this can be compared to the maximum date in downloaded spreadsheet(s)