MattCowgill / readabs

Download and tidy time series data from the Australian Bureau of Statistics in R
https://mattcowgill.github.io/readabs/
Other
101 stars 22 forks source link

Issue with cat_no in read_abs_local #220

Closed xIncubakerx closed 1 year ago

xIncubakerx commented 2 years ago

I am often working in an offline environment where I have access to the ABS Tables (not downloaded through read_abs) and need to read them in locally. At the moment I am able to do this by specifying each of the individual table names in the filenames argument, but I get an error if I try specifying the catalogue number using cat_no.

For instance, if I run read_abs_local(cat_no = "6202.0"), I get the following error:

Error: Sheet 'Index' not found

Note: I specify R_READABS_PATH using Sys.setenv(R_READABS_PATH = <path>) as recommended in the documentation, but I also get the same error if I specify path = Sys.getenv("R_READABS_PATH", unset = <path>)

MattCowgill commented 2 years ago

@xIncubakerx Sorry for the delay in getting back to you.

Can you please share the code you use in a reproducible example? Just so that I can figure out at what point of the process it's going wrong...

MattCowgill commented 2 years ago

For example, this code works as expected for me. Does it work for you?

Sys.setenv(R_READABS_PATH = file.path("data", "abs"))

library(readabs)

# Download all of 6345.0

read_abs("6345.0", check_local = F)
#> Finding URLs for tables corresponding to ABS catalogue 6345.0
#> Attempting to download files from catalogue 6345.0, Wage Price Index, Australia
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634501.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634502a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634502b.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634503a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634503b.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634504a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634504b.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634505a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634505b.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634507a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634507b.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634508a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634508b.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634509a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/634509b.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/63450table2ato9a.xlsx
#> Downloading https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/wage-price-index-australia/latest-release/63450table2bto9b.xlsx
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets
#> # A tibble: 63,950 × 12
#>    table_no sheet_no table_title date       series value serie…¹ data_…² colle…³
#>    <chr>    <chr>    <chr>       <date>     <chr>  <dbl> <chr>   <chr>   <chr>  
#>  1 634501   Data1    Table 1. T… 1997-09-01 Quart…  67.4 Origin… INDEX   3      
#>  2 634501   Data1    Table 1. T… 1997-09-01 Quart…  64.7 Origin… INDEX   3      
#>  3 634501   Data1    Table 1. T… 1997-09-01 Quart…  66.7 Origin… INDEX   3      
#>  4 634501   Data1    Table 1. T… 1997-09-01 Quart…  67.3 Season… INDEX   3      
#>  5 634501   Data1    Table 1. T… 1997-09-01 Quart…  64.8 Season… INDEX   3      
#>  6 634501   Data1    Table 1. T… 1997-09-01 Quart…  66.6 Season… INDEX   3      
#>  7 634501   Data1    Table 1. T… 1997-09-01 Quart…  67.3 Trend   INDEX   3      
#>  8 634501   Data1    Table 1. T… 1997-09-01 Quart…  64.8 Trend   INDEX   3      
#>  9 634501   Data1    Table 1. T… 1997-09-01 Quart…  66.7 Trend   INDEX   3      
#> 10 634501   Data1    Table 1. T… 1997-09-01 Perce…  NA   Origin… PERCENT 3      
#> # … with 63,940 more rows, 3 more variables: frequency <chr>, series_id <chr>,
#> #   unit <chr>, and abbreviated variable names ¹​series_type, ²​data_type,
#> #   ³​collection_month

# Read local

read_abs_local("6345.0")
#> # A tibble: 63,950 × 12
#>    table_no sheet_no table_title date       series value serie…¹ data_…² colle…³
#>    <chr>    <chr>    <chr>       <date>     <chr>  <dbl> <chr>   <chr>   <chr>  
#>  1 634501   Data1    Table 1. T… 1997-09-01 Quart…  67.4 Origin… INDEX   3      
#>  2 634501   Data1    Table 1. T… 1997-09-01 Quart…  64.7 Origin… INDEX   3      
#>  3 634501   Data1    Table 1. T… 1997-09-01 Quart…  66.7 Origin… INDEX   3      
#>  4 634501   Data1    Table 1. T… 1997-09-01 Quart…  67.3 Season… INDEX   3      
#>  5 634501   Data1    Table 1. T… 1997-09-01 Quart…  64.8 Season… INDEX   3      
#>  6 634501   Data1    Table 1. T… 1997-09-01 Quart…  66.6 Season… INDEX   3      
#>  7 634501   Data1    Table 1. T… 1997-09-01 Quart…  67.3 Trend   INDEX   3      
#>  8 634501   Data1    Table 1. T… 1997-09-01 Quart…  64.8 Trend   INDEX   3      
#>  9 634501   Data1    Table 1. T… 1997-09-01 Quart…  66.7 Trend   INDEX   3      
#> 10 634501   Data1    Table 1. T… 1997-09-01 Perce…  NA   Origin… PERCENT 3      
#> # … with 63,940 more rows, 3 more variables: frequency <chr>, series_id <chr>,
#> #   unit <chr>, and abbreviated variable names ¹​series_type, ²​data_type,
#> #   ³​collection_month

Created on 2022-10-11 by the reprex package (v2.0.1)

xIncubakerx commented 2 years ago

No worries @MattCowgill, thanks for getting back to me!

Yes, the code that you have provided there works just fine for me. I believe the issue is arising in instances where I have tried to import ABS spreadsheets downloaded directly from the ABS website (without using the readabs command that you ran in the middle). Perhaps this is due to the fact that the .fst file is not generated when I'm getting the spreadsheets from another source?

I am running the following commands with the correct data in the abs folder and I'm still getting the "Error: Sheet 'Index' not found" error. Sys.setenv(R_READABS_PATH = file.path("data", "abs")) read_abs_local("6202.0")

I've also tried specifiying use_fst = F but that also produces the same error.

It's all very strange...

MattCowgill commented 2 years ago

Hi @xIncubakerx, to be clear: are the files you've downloaded manually in the data/abs subdirectory of your working directory?

xIncubakerx commented 2 years ago

@MattCowgill yep the files are definitely there and it's looking in the right place for them. If I incorrectly specify the location, I get a "Could not find any .xls or .xlsx files in path: xxxx" error.

Since I'm having issues with cat_no at the moment, I'm currently getting around it by running this (using regex from stringr), although it seems that when you specify filenames that you need to add the catalogue number onto the R_READABS_PATH Sys.setenv(R_READABS_PATH = file.path("data", "abs","6202.0")) read_abs_local(filenames = list.files(path = "data/abs/6202.0", pattern = regex("^6202")))

MattCowgill commented 1 year ago

@xIncubakerx I'm afraid I still don't understand this issue and how the problem could be arising.

rowanad commented 1 year ago

@xIncubakerx What operating system are you using? It might help narrow down the issue.

Also, can you try renaming the "6202.0" directory to "6202" or similar? I am wondering if this could be an issue related to there being a period in the folder name "6202.0". I haven't looked super thoroughly at the source code, but I notice that the method you used when the error arose had the R_READABS_PATH as data/abs/ while it worked for you when R_READABS_PATH was data/abs/6202.0/. Somewhere in the chain of "readabs > readxl > underlying c/c++ binary" I think that maybe a library is being passed a pathname with this period in it and assuming that everything after the period is the file extension?

xIncubakerx commented 1 year ago

@MattCowgill and @rowanad Apologies it has taken me so long to get back to you, I was away and then this dropped off of my radar when I got back. I appreciate your efforts to assist me with this!

I believe that I have figured out the source of the problem. The folder that I was attempting to read the 6202 files from also contained the GM1 pivot table (which is also a part of the 6202 release), as I was reading this in separately as a part of my analysis. When read_abs_local tries to read the spreadsheets from the 6202 folder, it tries to read all of the .xlsx files within that folder. This causes a problem when it gets to GM1 because it doesn't have an Index tab and doesn't know how to handle this, hence the "Error: Sheet 'Index' not found" error. I was mistakenly thinking that read_abs_local would only read the spreadsheets in the folder that have the cat_no prefix (6202001.xlsx etc). On reflection, this was a pretty silly one for me to have missed, I'm sorry for taking up your time with this!

As soon as I removed GM1 from that folder, the code worked exactly as expected. That said, I'll probably just keep using the work around that I posted above so that I can leave GM1 where it is.

MattCowgill commented 1 year ago

Ah excellent @xIncubakerx, glad to know the problem is resolved. On reflection, read_abs_local() should be smart enough to just skip files like GM1.