DOI-USGS / ds-pipelines-targets-example-wqp

An example targets pipeline for pulling data from the Water Quality Portal (WQP)
Other
10 stars 14 forks source link

Convert missing value strings to NA #90

Closed lekoenig closed 2 years ago

lekoenig commented 2 years ago

This PR adds a step to format_columns() to convert missing value strings (e.g., "", " ", and "") to NA. This code change should not impact the number of records in the harmonized data set or the contents of p3_wqp_records_summary_csv.

Here is the output I see both before (top) and after (below) making these changes:

> # Before making changes in this PR:
> tar_load(p3_wqp_data_aoi_formatted)
> which(p3_wqp_data_aoi_formatted == " ", arr.ind = TRUE) %>% head(3) # col 28 is "ActivityCommentText"
       row col
[1,] 12104  28
[2,] 12105  28
[3,] 12106  28
> which(p3_wqp_data_aoi_formatted == " ", arr.ind = TRUE) %>% dim()
[1] 634   2
> which(p3_wqp_data_aoi_formatted == "", arr.ind = TRUE) %>% dim()
[1] 0 2
> which(p3_wqp_data_aoi_formatted == "<Blank>", arr.ind = TRUE) %>% dim()
[1] 0 2
> 
> # After running tar_make() with changes in this PR:
> tar_load(p3_wqp_data_aoi_formatted)
> which(p3_wqp_data_aoi_formatted == " ", arr.ind = TRUE) %>% head(3)
     row col
> which(p3_wqp_data_aoi_formatted == " ", arr.ind = TRUE) %>% dim()
[1] 0 2
> which(p3_wqp_data_aoi_formatted == "", arr.ind = TRUE) %>% dim()
[1] 0 2
> which(p3_wqp_data_aoi_formatted == "<Blank>", arr.ind = TRUE) %>% dim()
[1] 0 2
>
lekoenig commented 2 years ago

I really wish na_if() would accept multiple potential values but that https://github.com/tidyverse/dplyr/issues/1972.

Agree, thanks for linking to that issue!