Closed AndyMcAliley closed 3 years ago
tar_make()
often failed because the NWIS site downloads often failed. Creating a separate target for each site download helped ensure success more often.
tar_target()
call. There's probably a more scalable way to do that...data.frame
. I copied code out of download_data()
to do this.
site_info_csv
target is untouched. I considered wrapping this into site_data_annotated
, but it seems like a useful file to create all on its own.process_data
, annotate_data
, and style_data
into a single function, process_data
.
style_data
worked in the first place; it didn't call return()
!show_col_types = FALSE
when reading the site_info_csv
to clean up the output during execution of the pipeline.
tar_make()
often failed because the NWIS site downloads often failed. Creating a separate target for each site download helped ensure success more often.
👍 - good thinking
- There are five sites downloaded. I hardcoded each site's ID into its corresponding
tar_target()
call. There's probably a more scalable way to do that...
I'll make a note in your code directly, but you could pull those site ids out as an argument, rather than extracting them from the filename, though your code will function fine as is. As for scaling... I like how you're thinking! There is a more scalable way to download this data, which we'll cover in pipelines-III 😊
I wrote a new function to read the data from all downloaded csvs into one
data.frame
. I copied code out ofdownload_data()
to do this.
- Again, hardcoded site IDs make me think there's a better way.
Nice - this works well. You'll have to hold out until pipelines-III to learn about scaling pipelines! For now, listing all of this file targets as input to the downstream combined dataframe target is the best approach.
- The
site_info_csv
target is untouched. I considered wrapping this intosite_data_annotated
, but it seems like a useful file to create all on its own.
Agreed. I think this is good to have as a separate target.
- I combined
process_data
,annotate_data
, andstyle_data
into a single function,process_data
.
Nice! Much more efficient this way.
- I'm not sure why
style_data
worked in the first place; it didn't callreturn()
!
Ooh another R vs. Python difference! In R, the return value of a function is either the value of the last expression in the function, if no call to return()
is present or the argument to a call to return()
. So style_data()
automatically returned the dataframe resulting from the mutate()
call. Different people on our team have different styles, so it's up to you if you want to explicitly state what the function is returning or not. I typically do, since it's a nice confirmation for myself of what I want the function to return.
- I set
show_col_types = FALSE
when reading thesite_info_csv
to clean up the output during execution of the pipeline.
👍 Sounds good - that output can get annoying. You can also either specify the col_types, if you know them, e.g., `col_types = 'Dcdllc' for date, character, double, logical, logical, character, or set col_types = cols() to accept the column type assignments that R thinks are correct for each column. Sometimes for site numbers we do need to specify the column type as text, so that that leading zero isn't dropped.
I re-requested a review - hopefully that's the right way to proceed. I wonder if it's common to re-request reviews after making changes in everyday practice?
I think it's nice, b/c then the reviewer knows when you're ready for them to review any changes you've made.
I refactored the pipeline to make it more robust and straightforward. The major changes are