MattCowgill / readabs

Download and tidy time series data from the Australian Bureau of Statistics in R
https://mattcowgill.github.io/readabs/
Other
101 stars 22 forks source link

Read abs api 2022 update #198

Closed baslat closed 2 years ago

baslat commented 2 years ago

Hi Matt, here is a PR to add functionality to read the ABS API. I've put it as draft as I'm still messing about for some edge cases, but thought you might like to start taking a look.

Since Annabel's work a year or so ago, the ABS has changed their API, meaning I've basically rewritten everything. The main function is read_abs_api(). It calls a few internal functions, the workhorse being unexported tidy_api_data().

Reading the API requires overcoming two main challenges:

  1. the API returns two data sets, one with data, and one with metadata. These need to be combined to provide a tidy result (accomplished by tidy_api_data().
  2. some API URLs are too big to query, so they need to be broken into many smaller URLs. This is partially accomplished by chunk_query_url() and documented examples.

Happy to discuss!

I still need to:

baslat commented 2 years ago

It's ready for a proper review now. However, the readme re-rendered on my machine, so now all the download paths refer to my local machine. Do you have a github action to re-render the readme on merge or something similar?

kinto-b commented 2 years ago

Hey! I've been playing around with the API lately myself. I went a slightly different direction to you @baslat. I think our approaches may complement one another nicely. If you're interested, I can pop up a PR of my own for you (and @MattCowgill, great package by the way!) to take a look at

Here's a sample bit of code using the interface I wrote so you can get the flavour:


# List available flows
    abs_dataflows()

#> # A tibble: 504 x 4
#>    id                               name               desc              version
#>    <chr>                            <chr>              <chr>             <chr>  
#>  1 ABORIGINAL_POP_PROJ              Projected populat~ Contains estimat~ 1.0.0  
#>  2 ABORIGINAL_POP_PROJ_REMOTE       Projected populat~ Contains estimat~ 1.0.0  
#>  3 ABS_ABORIGINAL_POPPROJ_INDREGION Projected populat~ Contains estimat~ 1.0.0  
#>  4 ABS_ACLD_LFSTATUS                Australian Census~ The Australian C~ 1.0.0  
#>  5 ABS_ACLD_TENURE                  Australian Census~ The Australian C~ 1.0.0  
#>  6 ABS_ACLD_UNPAIDASST              Australian Census~ The Australian C~ 1.0.0  
#>  7 ABS_ACLD_VOLWORK                 Australian Census~ The Australian C~ 1.0.0  
#>  8 ABS_ANNUAL_ERP_ASGS              ERP by SA2 and ab~ Estimated Reside~ 1.0.0  
#>  9 ABS_ANNUAL_ERP_ASGS2016          ERP by SA2 and ab~ Estimated Reside~ 1.0.0  
#> 10 ABS_ANNUAL_ERP_LGA2016           ERP by LGA (ASGS ~ Estimated Reside~ 1.0.0  
#> # ... with 494 more rows

# Get full data set for a given flow by providing id:
    x <- abs_data("RES_DWELL")
    tibble::as_tibble(x)

#> # A tibble: 4,536 x 9
#>         MEASURE REGION      FREQ    TIME_PERIOD OBS_VALUE UNIT_MEASURE UNIT_MULT
#>       <dbl+lbl> <chr+lbl>   <chr+l> <chr>           <dbl> <chr+lbl>    <dbl+lbl>
#>  1 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2003-Q3         17000 NUM [Number] 0 [Units]
#>  2 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2003-Q4         15007 NUM [Number] 0 [Units]
#>  3 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2004-Q1         14930 NUM [Number] 0 [Units]
#>  4 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2004-Q2         13054 NUM [Number] 0 [Units]
#>  5 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2004-Q3         13264 NUM [Number] 0 [Units]
#>  6 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2004-Q4         13349 NUM [Number] 0 [Units]
#>  7 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2005-Q1         13591 NUM [Number] 0 [Units]
#>  8 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2005-Q2         12026 NUM [Number] 0 [Units]
#>  9 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2005-Q3         12954 NUM [Number] 0 [Units]
#> 10 1 [Number o~ 3RQLD [Res~ Q [Qua~ 2005-Q4         12749 NUM [Number] 0 [Units]
#> # ... with 4,526 more rows, and 2 more variables: OBS_STATUS <chr+lbl>,
#> #   OBS_COMMENT <lgl>

# Get filtered data using datakey:
    y <- abs_data("ABS_C16_G49_SA", datakey = ".....0")
    tibble::as_tibble(y)

#> # A tibble: 480 x 11
#>    OCCP_C16       SEX_ABS QALLP_C16       STATE REGIONTYPE ASGS_2016 TIME_PERIOD
#>    <chr+lbl>     <dbl+lb> <chr+lbl>     <dbl+l> <chr+lbl>  <chr+lbl>       <int>
#>  1 5 [Clerical ~ 3 [Pers~ TOT [Total]   0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  2 5 [Clerical ~ 3 [Pers~ 22 [Graduate~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  3 8 [Labourers] 2 [Fema~ 40 [Advanced~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  4 TOT [Total]   3 [Pers~ 40 [Advanced~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  5 TOT [Total]   3 [Pers~ 0 [Level of ~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  6 2 [Professio~ 2 [Fema~ TOT [Total]   0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  7 3 [Technicia~ 2 [Fema~ 50 [Certific~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  8 8 [Labourers] 2 [Fema~ 0 [Level of ~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#>  9 3 [Technicia~ 2 [Fema~ 10 [Postgrad~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#> 10 4 [Community~ 1 [Male~ 0 [Level of ~ 0 [Aus~ AUS [Aust~ 0 [Austr~        2016
#> # ... with 470 more rows, and 4 more variables: OBS_VALUE <int>,
#> #   UNIT_MEASURE <chr+lbl>, OBS_STATUS <chr+lbl>, OBS_COMMENT <lgl>

# Get metadata (useful to figure out how to build a `datakey`)
    z <- abs_datastructure("ABS_C16_G49_SA")
    tibble::as_tibble(z)

#> # A tibble: 3,008 x 6
#>    role      var      position desc       code  label                           
#>    <chr>     <chr>    <chr>    <chr>      <chr> <chr>                           
#>  1 dimension OCCP_C16 1        Occupation 1     Managers                        
#>  2 dimension OCCP_C16 1        Occupation 2     Professionals                   
#>  3 dimension OCCP_C16 1        Occupation 3     Technicians and Trades Workers  
#>  4 dimension OCCP_C16 1        Occupation 4     Community and Personal Service ~
#>  5 dimension OCCP_C16 1        Occupation 5     Clerical and Administrative Wor~
#>  6 dimension OCCP_C16 1        Occupation 6     Sales Workers                   
#>  7 dimension OCCP_C16 1        Occupation 7     Machinery Operators and Drivers 
#>  8 dimension OCCP_C16 1        Occupation 8     Labourers                       
#>  9 dimension OCCP_C16 1        Occupation TOT   Total                           
#> 10 dimension OCCP_C16 1        Occupation Z     Inadequately described and Not ~
#> # ... with 2,998 more rows
baslat commented 2 years ago

Hi @kinto-b , thanks for sharing your code! I'm happy to work together on a combined solution if you like. If I understand your snippet correctly it looks like you can get the list of available API datasets, which is cool!

MattCowgill commented 2 years ago

Thank you both, this is great! Sorry I haven't yet commented on your PR @baslat, I've been sick the last few days. Will review ASAP. Combining forces with @kinto-b seems sensible!

kinto-b commented 2 years ago

Stellar, I'll pop through a PR for you to browse and then we can decide on the best way to combine approaches

MattCowgill commented 2 years ago

@baslat @kinto-b Sorry again for the delay - I haven't forgotten about this, various work + life things have just got in the way of a speedy review of this. I'll get to it ASAP. Thanks

baslat commented 2 years ago

@baslat @kinto-b Sorry again for the delay - I haven't forgotten about this, various work + life things have just got in the way of a speedy review of this. I'll get to it ASAP. Thanks

No worries @MattCowgill . I had a look at @kinto-b 's branch and I think it's probably a better candidate for merging, so I suggest we focus there.

MattCowgill commented 2 years ago

Hi @baslat if @kinto-b's PR is the way forward (I'll take your work on that...) should we close this PR?

baslat commented 2 years ago

Yes please, I think it's a neater approach.

MattCowgill commented 2 years ago

No worries. Thanks for all your work on this @baslat