Refactor the existing pipeline to use more effective targets

github-learning-lab[bot] commented 3 years ago

:keyboard: Activity: Make modifications to the working, but less than ideal, pipeline that exists within your course repository

Within the course repo you should see only a _targets.R and directories with code or placeholder files for each phase. You should be able to run tar_make() and build the pipeline, although it may take numerous tries, since some parts of this new workflow are brittle. Some hints to get you started: the site_data target is too big, and you should consider splitting it into a target for each site, perhaps using the download_nwis_site_data() function directly to write a file. Several of the site_data_ targets are too small and it might make sense to combine them.

When you are happy with your newer, better workflow, create a pull request with your changes and assign your designated course instructor as a reviewer. Add a comment to your own PR with thoughts on how you approached the task, as well as key decisions you made.

Recall that you should not be committing any build artifacts of the pipeline to GitHub, so make sure that your */out/* folders are included in your .gitignore file.

You should create a local branch called "refactor-targets" and push that branch up to the "remote" location (which is the github host of your repository). We're naming this branch "refactor-targets" to represent concepts in this section of the lab. In the future you'll probably choose branch names according to the type of work they contain - for example, "pull-oxygen-data" or "fix-issue-17".

git checkout -b refactor-targets
git push -u origin refactor-targets

A human will interact with your pull request once you assign them as a reviewer

elliewhite-usgs commented 3 years ago

@lindsayplatt, if you don't get to this, no worries. I'll tag Sam tomorrow.

I'm trying to avoid doing something like this: site_nums <- c("01427207", "01432160", "01435000", "01436690", "01466500") p1_targets_list <- list( tar_target(site_data_1, download_nwis_data(site_nums[1])), tar_target(site_data_2, download_nwis_data(site_nums[2])), tar_target(site_data_3, download_nwis_data(site_nums[3])), tar_target(site_data_4, download_nwis_data(site_nums[4])), tar_target(site_data_5, download_nwis_data(site_nums[5])), tar_target(site_info_csv_1, nwis_site_info(fileout = "1_fetch/out/site_info_1.csv", site_data_1), format = "file"), tar_target(site_info_csv_2, nwis_site_info(fileout = "1_fetch/out/site_info_2.csv", site_data_2), format = "file"), tar_target(site_info_csv_3, nwis_site_info(fileout = "1_fetch/out/site_info_3.csv", site_data_3), format = "file"), tar_target(site_info_csv_4, nwis_site_info(fileout = "1_fetch/out/site_info_4.csv", site_data_4), format = "file"), tar_target(site_info_csv_5, nwis_site_info(fileout = "1_fetch/out/site_info_5.csv", site_data_5), format = "file") )

It's kinda hurting my brain, cause I don't want to repeat myself when the same thing is happening to the same kind of object, so should I define a function that takes in the sites and makes the targets? that's sort of meta. I'm also trying to think about this when it's many more sites than the 5 we have here, and I realize you all put in the timeouts artificially here, so, yeah, not sure how this "solution" would scale. (It's the end of the day again and my brain is fried. I'll try again tomorrow)

lindsayplatt commented 3 years ago

Hi @whiteellie! I love that you are thinking of scaling BUT I would just say, don't think too hard about it at this point. All of Pipelines III is about answering that question - how can we write a pipeline such that we don't need to copy/paste targets code for each site that we want to use. As an example, here is what my targets for this step looked like:

  tar_target(site_data_01427207_csv, download_nwis_site_data('1_fetch/tmp/site_data_01427207.csv'), format = "file"),
  tar_target(site_data_01432160_csv, download_nwis_site_data('1_fetch/tmp/site_data_01432160.csv'), format = "file"),
  tar_target(site_data_01435000_csv, download_nwis_site_data('1_fetch/tmp/site_data_01435000.csv'), format = "file"),
  tar_target(site_data_01436690_csv, download_nwis_site_data('1_fetch/tmp/site_data_01436690.csv'), format = "file"),
  tar_target(site_data_01466500_csv, download_nwis_site_data('1_fetch/tmp/site_data_01466500.csv'), format = "file"),

Very repetitive, which is typically something we think "gah! cannot repeat code. there must be a better way" and there totally is. But for now, roll with this because that scaling topic is the entirety of Pipelines III 😄

elliewhite-usgs commented 3 years ago

@limnoliver I'm not sure why this is happening, when I run tar_make() it gives me an error saying I can't rename a column that doesn't exist... So, I put a browser() inside the function where this is happening, and when I step through the function (I've done this with 2 iterations of the loop) it works out fine. The column exists and gets renamed just fine. Any ideas why target is acting weird?

Here's the function:

process_data <- function(filepath){
  browser()
  nwis_data <- read_csv(file = filepath)
  nwis_data_clean <- nwis_data %>% 
    rename(water_temperature = X_00010_00000) %>% 
    select(-agency_cd, -X_00010_00000_cd, -tz_cd)

  return(nwis_data_clean)
}

limnoliver commented 3 years ago

There's nothing sticking out to me in that chunk. Maybe push your code and create a PR, and I can help you debug?

limnoliver commented 3 years ago

Well - welcome to using NWIS! Take a look at the csv for site 01435000. What's going on here?!

Looks like there are a lot of temperature columns for this site - which come from different sensors but that all of the 00010 parameter code. There has been recent work at this site, which probably recently forced further naming of those columns which is tripping up the pipeline. For now, why don't you just exclude that site and carry on? I created an issue in this course's repo so we can permanently change that.

elliewhite-usgs commented 3 years ago

Hahaha. This is why I always write a try/catch with NWIS.

limnoliver commented 3 years ago

You'll get to do all of these things in Piplines III :)

elliewhite-usgs / ds-pipelines-targets-2

Refactor the existing pipeline to use more effective targets #5

A human will interact with your pull request once you assign them as a reviewer