Splitters - Githubissues

github-learning-lab[bot] commented 2 years ago

In the last issue you noted a lingering inefficiency: When you added Illinois to the states vector, your branching pipeline built nwis_data_WI, nwis_data_MN, and nwis_data_MI again even though there was no need to download those files again. This happened because those three targets each depend on oldest_active_sites, the inventory target, and that target changed to include information about a gage in Illinois. As I noted in that issue, it would be ideal if each task branch only depended on exactly the values that determine whether the data need to be downloaded again. But we need a new tool to get there: a splitter.

The splitter we're going to create in this issue will split oldest_active_sites into a separate table for each state. In this case each table will be just one row long, but there are plenty of situations where the inputs to a set of tasks will be larger, even after splitting into task-size chunks. Some splitters will be quick to run and others will take a while, but either way, we'll be saving ourselves time in the overall pipeline!

Background

The object way to split

So far in our pipeline, we already have an object that contains the inventory information for all of the states, oldest_active_sites. Now, we can write a splitter to take the full inventory and one state name and return a one-row table.

get_state_inventory <- function(sites_info, state) {
  site_info <- dplyr::filter(sites_info, state_cd == state)
}

And then we could insert an initial branching step where we pulled out that state's information before passing it to the next step, such that our tar_map() call would look like:

tar_map(
  values = tibble(state_abb = states),
  tar_target(nwis_inventory, get_state_inventory(sites_info = oldest_active_sites, state_abb)),
  tar_target(nwis_data, get_site_data(nwis_inventory, state_abb, parameter))
)

The file way to split

The "object way to split" described above works well in many cases, but note that get_state_inventory() is called for each of our task targets (so each state). Suppose that oldest_active_sites was a file that took a long time to read in - we've encountered cases like this for large spatial data files, for example - you'd have to re-open the file for each and every call to get_state_inventory(), which would be excruciatingly slow for a many-state pipeline. If you find yourself in that situation, you can approach "splitting" with files rather than objects.

Instead of calling get_state_inventory() once for each state, we could and write a single splitter function that accepts oldest_active_sites and writes a single-row table for each state. It will be faster to run because there will not be redundant reloading of the data that is needing to be split. This type of splitter would not be within your branching code and instead return a single summary table describing the state-specific files that were just created.

For this next exercise, the object method for splitting described before will suit our needs just fine. There is no need to create a single splitter function that saves state-specific files for now. We are mentioning it here so that you can be aware of the limitations of splitters and be aware that other options exist.

Your mission

In this issue you'll create a splitter to make your task table more efficient in the face of a changing inventory in oldest_active_sites. Your splitter function will generate separate one-row inventory data for each state.

Ready?

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Switch to a new branch

Before you edit any code, create a local branch called "splitter" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout main
git pull origin main
git checkout -b splitter
git push -u origin splitter

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "splitter" branch.

elmeraa-usgs commented 2 years ago

Commenting to continue on

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Create a separate inventory for each state

[x] Add a new target to your tar_map() call just before the nwis_data target but below the values input using this boilerplate:
```
tar_target(nwis_inventory, ),
```
[x] Add code to subset the rows in oldest_active_sites based on the branching variable, state_abb. Remember that oldest_active_sites has a column called state_cd containing the state abbreviations. Hint: go peek at the first line of the function get_site_data() in 1_fetch/src/get_site_data.R.
[x] Edit your call for the nwis_data target to use nwis_inventory instead of oldest_active_sites to take advantage of your newly split data.
[x] Lastly, the first step in get_site_data() that filters the input data is not longer needed (because that is taken care of in your new splitter step!). But careful - the incoming data is an argument called sites_info but the rest of the function relies on site_info (singular site not sites). So, delete that first line but then update the argument name to be singular, site_info. Now you are good :)

Test

When you think you've got it right, run your pipeline again!

tar_make()

You should now see targets being built called nwis_inventory_WI, nwis_inventory_IL, etc. It should redownload all of the data for WI, MN, MI, and IL (so rebuild nwis_data_WI, nwis_data_MI, etc) because we changed the inputs and the function for those targets. The real magic comes next.

If you're not quite getting the build to work, keep editing until you have it (but remember that there may still be "internet transfer failures" which require you to run tar_make() a few times). When you've got it, copy and paste the console output of tar_make() and tar_visnetwork() into a comment on this issue.

I'll respond when I see your comment.

elmeraa-usgs commented 2 years ago

Capture Capture2

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Test your splitter's power

You have a fancy new splitter, but you still need to see the benefits in action.

Test

[x] Call tar_make() one more time. Nothing should rebuild.
[x] Add Indiana (IN) and Iowa (IA) to the vector of states in _targets.R. Rebuild. Did you see the rebuilds and non-rebuilds that you expected?

(If you're not sure what you should have expected, check with your course contact, or another teammate.)

Commit and PR

Comfortable with your pipeline's behavior? Time for a PR!

[x] Commit your changes to 1_fetch/src/get_site_data.R, and _targets.R. Use git push to push your change up to the "splitter" branch on GitHub.

When everything is committed and pushed, create a pull request on GitHub. In your PR description note which targets got built when you added IN and IA to states.

elmeraa-usgs / ds-pipelines-targets-3

Splitters #6

Background

The object way to split

The file way to split

Your mission

:keyboard: Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "splitter" branch.

:keyboard: Activity: Create a separate inventory for each state

Test

I'll respond when I see your comment.

:keyboard: Activity: Test your splitter's power

Test

Commit and PR

I'll respond on your new PR once I spot it.