Col select for `get_dataset`

Moohan commented 2 years ago

Could get_dataset be amended to use the new col_select arguments?

A lot of datasets have all resources being basically the same, so this would lend itself to column selection really nicely. e.g. gp-practice-contact-details-and-list-sizes

There would be some issues with datasets where each of the resources are different e.g. geography-codes-and-labels but I think it could still be useful if implemented well. A separate but related issue is that get_dataset("gp-practice-contact-details-and-list-sizes") currently doesn't work, because some datasets store HB as a string but others as a number.

daikman commented 2 years ago

@Moohan Good idea to add col_select!

And thanks for spotting this problem with the column data types. In this case it is actually a problem with the data rather than the function, as they should be the same type. However, just to be safe I've now added a fix that coerces columns with inconsistent data types to character, and warns the user about it.

There's now a branch for this issue

Moohan commented 2 years ago

Thanks, nice work with the column fix - you could probably use that to loop through all the datasets and spit out a list of datasets where certain resources have variable types which could possibly be fixed!

I couldn't resist doing that so I ran some code and pulled a list of the datasets with column mismatch issues:

library(phsopendata)

content <- phsopendata:::phs_GET("package_list", "")

dataset_names <- unlist(content$result)

check_data_set <- purrr::possibly(get_dataset, NA)

check <- purrr::map(dataset_names, ~check_data_set(.x, rows = 1)) %>% 
  purrr::set_names(dataset_names)

names(check[is.na(check)])

Ones with issues:

[1] "annual-cancer-incidence"                                                                    
 [2] "births-in-scottish-hospitals"                                                               
 [3] "cancer-mortality"                                                                           
 [4] "care-home-census"                                                                           
 [5] "covid-19-positive-cases-in-pregnancy-in-scotland"                                           
 [6] "dental-practices-and-patient-registrations"                                                 
 [7] "geography-codes-and-labels"                                                                 
 [8] "gp-practice-contact-details-and-list-sizes"                                                 
 [9] "learning-disability-inpatient-activity"                                                     
[10] "long-acting-reversible-methods-of-contraception-larc-in-scotland"                           
[11] "mental-health-inpatient-activity"                                                           
[12] "nhsscotland-payments-to-general-practice"                                                   
[13] "primary-1-body-mass-index-bmi-statistics"                                                   
[14] "scottish-suicide-information-database-contact-with-unscheduled-care-services-prior-to-death"
[15] "teenage-pregnancy"                                                                          
[16] "termination-of-pregnancy-in-scotland"                                                       
[17] "weekly-covid-19-statistical-data-in-scotland"

Moohan commented 2 years ago

There's now a branch for this issue

@daikman This looks ok to me, could you open a PR, or were there some other changes you wanted to make?

daikman commented 2 years ago

@Moohan I've opened a pull request for this now. Happy to merge whenever

Pull request

Public-Health-Scotland / phsopendata

Col select for `get_dataset` #8