MattCowgill / readabs

Download and tidy time series data from the Australian Bureau of Statistics in R
https://mattcowgill.github.io/readabs/
Other
101 stars 22 forks source link

Issue with 3401.0 #203

Closed MattCowgill closed 2 years ago

MattCowgill commented 2 years ago

Appears to be due to some wacky spreadsheet formatting decisions by the ABS.

Screen Shot 2022-06-16 at 3 06 13 pm Screen Shot 2022-06-16 at 3 06 24 pm
library(readabs)
#> Environment variable 'R_READABS_PATH' is unset. Downloaded files will be saved in a temporary directory.
#> You can set 'R_READABS_PATH' at any time. To set it for the rest of this session, use
#>  Sys.setenv(R_READABS_PATH = <path>)

# This fails
read_abs("3401.0", "1")
#> Finding URLs for tables corresponding to ABS catalogue 3401.0
#> Error in `vectbl_as_row_location()`:
#> ! Must subset rows with a valid subscript vector.
#> ℹ Logical subscripts must match the size of the indexed input.
#> ✖ Input has size 11 but subscript `match_tables(xml_dfs$TableTitle, tables)` has size 0.

# This works, despite this series ID being in that spreadsheet
read_abs_series("A85232555A")
#> Finding URLs for tables corresponding to ABS series ID
#> Attempting to download files from series ID , Overseas Arrivals and Departures, Australia
#> Downloading https://www.abs.gov.au/statistics/industry/tourism-and-transport/overseas-arrivals-and-departures-australia/latest-release/340101.xlsx
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets
#> # A tibble: 556 × 12
#>    table_no sheet_no table_title   date       series value series_type data_type
#>    <chr>    <chr>    <chr>         <date>     <chr>  <dbl> <chr>       <chr>    
#>  1 340101   Data1    Table 1: Tot… 1976-01-01 Numbe…  2750 Original    FLOW     
#>  2 340101   Data1    Table 1: Tot… 1976-02-01 Numbe…  2730 Original    FLOW     
#>  3 340101   Data1    Table 1: Tot… 1976-03-01 Numbe…  1940 Original    FLOW     
#>  4 340101   Data1    Table 1: Tot… 1976-04-01 Numbe…  1620 Original    FLOW     
#>  5 340101   Data1    Table 1: Tot… 1976-05-01 Numbe…  1890 Original    FLOW     
#>  6 340101   Data1    Table 1: Tot… 1976-06-01 Numbe…  1790 Original    FLOW     
#>  7 340101   Data1    Table 1: Tot… 1976-07-01 Numbe…  1640 Original    FLOW     
#>  8 340101   Data1    Table 1: Tot… 1976-08-01 Numbe…  1890 Original    FLOW     
#>  9 340101   Data1    Table 1: Tot… 1976-09-01 Numbe…  1890 Original    FLOW     
#> 10 340101   Data1    Table 1: Tot… 1976-10-01 Numbe…  1790 Original    FLOW     
#> # … with 546 more rows, and 4 more variables: collection_month <chr>,
#> #   frequency <chr>, series_id <chr>, unit <chr>

Created on 2022-06-16 by the reprex package (v2.0.1)

MattCowgill commented 2 years ago

Actual problem: They've gone with "Table 1: Blah Blah" instead of "Table 1. Blah Blah". :)