MattCowgill / readabs

Download and tidy time series data from the Australian Bureau of Statistics in R
https://mattcowgill.github.io/readabs/
Other
101 stars 22 forks source link

Erroneous conversion to numeric of ANZSIC codes from read_api #249

Closed kletts closed 4 months ago

kletts commented 4 months ago

Thanks for a great package, in combination with the Data Explorer it has massively improved how I work with ABS data.

The labelling function for the read_api coerces number codes to numeric and this appears to result in labels not being applied. This is particularly a problem with ABS classification structures such as ANZSIC where the code 01 with a leading zero refers to a specific classification or in this case the agriculture subdivision of agriculture, forestry and fishing.

A reproducible example is from the following extract, labels are missing for industry 01 and 02 but provided for 12. The code value has been converted to 1, 2, 12:

readabs::read_api(
  id="AUSTRALIAN_INDUSTRY", 
  datakey=list(measure="INDUSTRY", 
              industry=c("01", "02", "12"), 
              basis="1", 
              region="AUS", 
              freq="A"), 
  start_period="2023")

The question worth discussing is should numbers in code lists ever be coerced to numeric. It seems reasonable that in general codes are not numbers, but there are examples where codes are used as numbers by the ABS, for example unit_mult:

subset(readabs::read_api_datastructure(id="AUSTRALIAN_INDUSTRY"), var=="unit_mult")

My personal opinion is that the safer practice is to keep all codes as characters, as coercion by the user, where required, is easy to perform but reversing the erroneous conversion is a big and messy job.

Christian

MattCowgill commented 4 months ago

Hi Christian @kletts, Thanks very much for your kind comments and for opening this issue. I agree that the problem you've identified needs to be rectified. I also agree with your opinion at the end - better to not coerce to numeric, even if this may create extra work for the user in some circumstances, than to coerce to numeric when that's not appropriate.

I have limited scope to fix this issue right now. I'd be happy to review a PR or otherwise will get to this when I can.

kletts commented 4 months ago

Cool, I've raised a PR for you with a proposed change. I had thought the coercion was in the abs_api_label_data function, but it turns out to be happening first upstream on the raw download by read.csv

MattCowgill commented 4 months ago

Thanks so much @kletts !

MattCowgill commented 4 months ago

This is in master now, @kletts. Thanks again

kletts commented 4 months ago

Thanks Matt for the prompt update