DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Draft for case_when/purrr-based multi-file workflow #318

Closed jordansread closed 2 years ago

jordansread commented 2 years ago

Example multi-file parse with purrr and case_when.

I made a few changes here to show an alternatively workflow that uses a tibble instead of separate vectors of file names. I think this is easier to understand, because you can look at the files_tbl and see where each file will be dispatched to:

files_tbl
# A tibble: 105 × 2
   filepath                                                                                                                                                                    func_name             
   <chr>                                                                                                                                                                       <chr>                 
 1 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/030_05_31_2017 cleaned.csv     read_files_2017       
 2 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/030_06_20_2017 cleaned.csv     read_files_2017       
 3 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/030_08_01_2017 FNU cleaned.csv read_files_2017       
 4 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/030_09_13_2017 HW.csv          read_files_2017_hw    
 5 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/089_08_04_2017 HW.csv          read_files_2017_hw    
 6 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/089_08_17_2017 HW.csv          read_files_2017_hw    
 7 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/089_08_23_2017 HW.csv          read_files_2017_hw    
 8 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/089_09_27_2017 HW.csv          read_files_2017_hw    
 9 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/092_08_02_2017 HW.csv          read_files_2017_092_hw
10 /var/folders/b1/j14cjj994hxbp82qv23tj27jc644jk/T//Rtmp2wpnBh/UniversityofMissouri_LimnoProfiles_2017-2020/UniversityofMissouri_2017_Profiles/092_08_17_2017 HW.csv          read_files_2017_hw    
# … with 95 more rows

I also modified the functions to all return data.frames/tibbles with the same shape. The prior ones had some differences, which is why the all_hw_files had a bind_rows with dat_2017_hw, dat_2017_hw_092, dat_2018_hw and then a mutate. I collapsed a few things with shared arguments to make some of the processing functions a little more generic (using get() to pull a column name from the function argument and use that in the mutate call, for example).

I used case_when to make the logic clear and in one place as to which file gets which handling function.

Lastly, I used purrr::pmap to map over the rows of the data.frame and expose both the file name and the function handler. Then the function is called within that with exec, which is basically a way to call the function when it is given as a string instead of an object.

jordansread commented 2 years ago

Oh yeah, and I checked that the old and the new file are identical.

They actually aren't, but that is because the order of the rows is a little different (probably because of the tibble order). After accounting for that, they are the same:

all.equal(readRDS('~/Downloads/UniversityofMissouri_LimnoProfiles_2017_2020_OLD.rds') %>% arrange(DateTime, time, depth, Missouri_ID), readRDS('7a_temp_coop_munge/tmp/UniversityofMissouri_LimnoProfiles_2017_2020.rds') %>% arrange(DateTime, time, depth, Missouri_ID))
[1] TRUE