AndyMcAliley / ds-pipelines-targets-2

https://lab.github.com/USGS-R/usgs-targets-tips-and-tricks
0 stars 0 forks source link

Refactor targets #6

Closed AndyMcAliley closed 3 years ago

AndyMcAliley commented 3 years ago

I refactored the pipeline to make it more robust and straightforward. The major changes are

  1. Download each NWIS site as a separate target.
  2. Simplify processing by combining smaller steps into one. See #5
AndyMcAliley commented 3 years ago
hcorson-dosch-usgs commented 3 years ago
  • tar_make() often failed because the NWIS site downloads often failed. Creating a separate target for each site download helped ensure success more often.

👍 - good thinking

  • There are five sites downloaded. I hardcoded each site's ID into its corresponding tar_target() call. There's probably a more scalable way to do that...

I'll make a note in your code directly, but you could pull those site ids out as an argument, rather than extracting them from the filename, though your code will function fine as is. As for scaling... I like how you're thinking! There is a more scalable way to download this data, which we'll cover in pipelines-III 😊

  • I wrote a new function to read the data from all downloaded csvs into one data.frame. I copied code out of download_data() to do this.

    • Again, hardcoded site IDs make me think there's a better way.

Nice - this works well. You'll have to hold out until pipelines-III to learn about scaling pipelines! For now, listing all of this file targets as input to the downstream combined dataframe target is the best approach.

  • The site_info_csv target is untouched. I considered wrapping this into site_data_annotated, but it seems like a useful file to create all on its own.

Agreed. I think this is good to have as a separate target.

  • I combined process_data, annotate_data, and style_data into a single function, process_data.

Nice! Much more efficient this way.

  • I'm not sure why style_data worked in the first place; it didn't call return()!

Ooh another R vs. Python difference! In R, the return value of a function is either the value of the last expression in the function, if no call to return() is present or the argument to a call to return(). So style_data() automatically returned the dataframe resulting from the mutate() call. Different people on our team have different styles, so it's up to you if you want to explicitly state what the function is returning or not. I typically do, since it's a nice confirmation for myself of what I want the function to return.

  • I set show_col_types = FALSE when reading the site_info_csv to clean up the output during execution of the pipeline.

👍 Sounds good - that output can get annoying. You can also either specify the col_types, if you know them, e.g., `col_types = 'Dcdllc' for date, character, double, logical, logical, character, or set col_types = cols() to accept the column type assignments that R thinks are correct for each column. Sometimes for site numbers we do need to specify the column type as text, so that that leading zero isn't dropped.

AndyMcAliley commented 3 years ago

I re-requested a review - hopefully that's the right way to proceed. I wonder if it's common to re-request reviews after making changes in everyday practice?

hcorson-dosch-usgs commented 3 years ago

I think it's nice, b/c then the reviewer knows when you're ready for them to review any changes you've made.

github-learning-lab[bot] commented 3 years ago


When you are done poking around, check out the next issue.