Branching - Githubissues

In the last issue you noted some inefficiencies with writing out many nearly-identical targets within a remake.yml:

It's a pain (more typing and potentially a very long _targets.R file) to add new sites.
Potential for errors (more typing, more copy/paste = more room for making mistakes).

In this issue we'll fix those inefficiencies by adopting the branching approach supported by targets and the support package tarchetypes.

Definitions

Branching in targets refers to the approach of scaling up a pipeline to accomodate many tasks. It is the targets implementation of the split-apply-combine operation. In essence, we split a dataset into some number of tasks, then to each task we apply one or more analysis steps. Branches are the resulting targets for each unique task-and-step match.

In the example analysis for this course, each task is a state and the first step is a call to get_site_data() for that state's oldest monitoring site. Later we'll create additional steps for tallying and plotting observations for each state's site. See the image below for a conceptual model of branching for this course analysis.

Branches

We implement branching in two ways: as static branching, where the task targets are predefined before the pipeline runs, and dynamic branching, where task targets are defined while the pipeline runs.

In this issue you'll adjust the existing pipelining to use branching for this analysis of USGS's oldest gages.

:keyboard: Activity: Switch to a new branch

Before you edit any code, create a local branch called "static-branching" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout main
git pull origin main
git checkout -b static-branching
git push -u origin static-branching

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "static-branching" branch.

commenting to continue on

Before we get to editing, let's briefly discuss the differences between static and dynamic branching, and how you choose between them.

Dynamic branching

In dynamic branching, your tasks are defined by another target in your pipeline. They are dynamic because the task targets can change while the pipeline is being built. This is particular useful when the tasks depend on some number of files that may change through time as you run your pipeline. Read more about the key parts of dynamic branching below.

Set up: Dynamic branching is set up by using tar_target() as you usually would (where the command you pass in represents one step) but adding the argument pattern to define how to split a previous target into tasks. Typically, you will see map() used to define the pattern for splitting up a target into tasks, but you can find details about other options available here in the targets documentation.
Define multiple steps: To apply multiple steps to a set of tasks, you will need to write addition calls to tar_target() calls with pattern specified.
Combining branch output: To complete our split-apply-combine paradigm, we would need the output from each of our branches to be combined into one result per step. Dynamic branching will automatically combine the branches into a single output target for the branching step.
Validation/testing: For dynamic branching, you can test whether your branches are set up how you want them to be by using tar_pattern() and iterate on your branching structure before actually executing the pipeline.

Building on the example introduced previously, where we need to download, tally, and then plot data for multiple states, here is what dynamic branching could look like:

library(targets)
library(tidyverse)
library(tarchetypes)

# Add source calls to files containing `get_nwis_data()`, `tally()`, and `plot()` here

list(
  tar_target(states, c('WI', 'MN', 'MI')),
  tar_target(data, get_nwis_data(states), pattern = map(states)),
  tar_target(count, tally(data), pattern = map(data)),
  tar_target(fig, plot(count), pattern = map(count))
)

Static branching:

In static branching, your tasks are defined by a named list or data.frame passed into your branching command. This is static because the tasks created won't update with the pipeline. You would need to update the list or data.frame. Read more about the key parts of static branching below.

Set up: Static branching is set up by using tar_map() function from the package tarchetypes. First, you pass in your tasks as a named list or data.frame into the values argument. Then, you can set up a step by adding a call to tar_target() and using the column or list element name containing the unique tasks as an argument to your command function.
Define multiple steps: To apply multiple steps to a set of tasks, pass additional calls to tar_target() as arguments to tar_map().
Combining branch output: To complete our split-apply-combine paradigm, we would need the output from each of our branches to be combined into one result per step. Static branching does not automatically combine the branches into a single output target for the branching step. After tar_map(), add a target for tar_combine(), where you pass in the output from tar_map() and then specify the command used to combine the results into one object.
Validation/testing: For static branching, you can test whether your branches are set up how you want them to be by using tar_visnetowrk(), as we have done to inspect our pipelines without branching. Once you are happy with your branching set up, you can execute the pipeline.

Going back once again to the pipeline where we need to download, tally, and then plot data for multiple states, here is what the static branching version would look like:

library(targets)
library(tidyverse)
library(tarchetypes)

# Add source calls to files containing `get_nwis_data()`, `tally()`, and `plot()` here

tasks <- tibble(states = c('WI', 'MN', 'MI'))

list(
  tar_map(
    values = tasks,
    tar_target(data, get_nwis_data(states)),
    tar_target(count, tally(data)),
    tar_target(fig, plot(count))
  )
)

How do you choose?

How do you know when to use dynamic or static branching? This is tricky because both will work in many scenarios (as we saw above), but it ultimately comes down to how your tasks are defined.

When your tasks are predefined (e.g. states, a few basins, a specific set of user-defined sites), it makes sense to use static branching (though you can use dynamic as illustrated with our examples above). This doesn't mean you can't manually add a few additional tasks (e.g. include more states) but it means that adding more is a manual step that the user needs to remember to do. One pro to using static branching in these instances is that you can visualize your branches with the rest of your pipeline using tar_visnetwork(), whereas you cannot visualize the branches when using the dynamic branching approach.

When your tasks could change based on previous parts of your pipeline, you should choose dynamic branching. Examples of this include iterating over files in a directory (the files could change!) or using an inventory of sites to then pull data (when the inventory reruns, it may return different sites). A con is not being able to visualize your branches ahead of time, but you can still inspect them by running tar_pattern(). A pro with dynamic branching is that it follows the same pattern as all of your other targets, by using tar_target() with just one additional argument specified. Another pro for dynamic branching is that your output from each branch is automatically combined into one target.

With that intro out of the way, let's get going on implementing code for branching already!

Add a comment to this issue to proceed.

Adding comment to continue on

Now that you have learned about branching, let's add it to our code. Currently, you have 3 individual targets that will download site data from our 3 Midwest states and store in a target named with the state name. Those targets look something like this:

tar_target(wi_data, get_site_data(oldest_active_sites, states[1], parameter)),
tar_target(mn_data, get_site_data(oldest_active_sites, states[2], parameter)),
tar_target(mi_data, get_site_data(oldest_active_sites, states[3], parameter)),

We are going to convert the code for those targets into static branching. We are going to make these changes on the "static-branching" branch that we created earlier. Let's get started!

:keyboard: Activity: Implement static branching to download data by state

Include appropriate packages

Now that we are using static branching, our main pipeline makefile will need the tarchetypes package. In addition, we will use tibble::tibble() to define our task data.frame. Make these two packages available to the targets pipeline by adding the following to the other library calls in _targets.R:

library(tarchetypes)
library(tibble)

Replace state targets with branching code

To get started, copy the code below and replace your 3 individual state targets (shown above) with it.

tar_map(
  values = tibble(state_abb = states),
  tar_target(data, get_site_data(oldest_active_sites, state_abb, parameter))
  # Insert step for tallying data here
  # Insert step for plotting data here
),

Verify your task values

This is already set up for you, but is worth going over. Your task names are passed into tar_map() using the argument values. This argument will accept a list or a data.frame/tibble and the names of the list elements or columns are used as arguments to the functions in your steps. Rather than change the states object outside of tar_map() (because that would require us to also update oldest_active_sites which uses states), we are using that vector to create a column called state_abb in the tibble passed to values. That means, when we need to pass in the task names as an argument to a function, we use state_abb, the column name containing those task names.

Check your progress

You have already learned about tar_visnetwork() as a way to visualize your pipeline before running it. By default, it will show targets and functions. We would just like to check that our branch targets are set up appropriately, so try running tar_visnetwork(targets_only = TRUE) to get a visual with just targets. You should see something similar to the image below, where there are three targets prefixed with data_.

branch_targets

Those targets prefixed with data_ are the branches (targets per task-step) for the get_site_data() step. They are automatically named using the target name you pass to tar_target() + an _ + the task identifier. You can test this by changing that target name from data to nwis_data and re-running tar_visnetwork(targets_only = TRUE). You should now see the branches nwis_data_MI, nwis_data_MN, and nwis_data_WI in the visual.

You can also use a function called tar_manifest() to check your pipeline before running. It will return a table of information about each target and the function call that will be used to create it. Try running tar_manifest(). You should see

# A tibble: 5 x 3
  name                command                                                            pattern
  <chr>               <chr>                                                              <chr>  
1 oldest_active_sites "find_oldest_sites(states, parameter)"                             NA     
2 nwis_data_MI        "get_site_data(oldest_active_sites, \"MI\", parameter)"            NA     
3 nwis_data_MN        "get_site_data(oldest_active_sites, \"MN\", parameter)"            NA     
4 nwis_data_WI        "get_site_data(oldest_active_sites, \"WI\", parameter)"            NA     
5 site_map_png        "map_sites(\"3_visualize/out/site_map.png\", oldest_active_sites)" NA

If your pipeline doesn't look as you expect it should, keep iterating on your code in the _targets.R file. When you're happy with your pipeline, run tar_manifest(starts_with('nwis_data')) to see the details for just the branches. Copy and paste the output into a new comment on this issue.

I'll respond when I see your comment.


# A tibble: 3 × 3
  name         command                                                 pattern
  <chr>        <chr>                                                   <chr>  
1 nwis_data_MI "get_site_data(oldest_active_sites, \"MI\", parameter)" NA     
2 nwis_data_MN "get_site_data(oldest_active_sites, \"MN\", parameter)" NA     
3 nwis_data_WI "get_site_data(oldest_active_sites, \"WI\", parameter)" NA     ```

Now that you have branching set up for downloading data from NWIS, it is time to run the pipeline!

:keyboard: Activity: Use `tar_make()` to build the pipeline with static branching

Run tar_make() to execute your pipeline. You may have to call tar_make() a few times to get through any [pretend] failures in the data pulls (I had to run it 5 times), but ultimately you should have seen something like this output:

> tar_make()
v skip target oldest_active_sites
v skip target nwis_data_MI
v skip target nwis_data_MN
* run target nwis_data_WI
  Retrieving data for WI-04073500
* run target site_map_png
* end pipeline

If you're not there yet, keep trying until your output has only * or v next to the output. Then proceed:

Call tar_make() one more time. You should see a green "V" next to each target.
Add 'IL' to the states target. Then call tar_make() again (you may have to run it multiple times to get passed [pretend] failures). It builds data_IL for you right? Cool! But there's something inefficient happening here, too - what is it? Can you guess why this is happening?
Make a small change to the get_site_data() function: change Sys.sleep(2) to Sys.sleep(0.5). Then call tar_make() again (and again and again if you get [pretend] internet failures). What happened?

Answer the questions from 2 and 3 above in a new comment on this issue.

I'll respond when I see your comment.

We're having to continuously run tar_make() since we're getting a connection error that was set up to suspend the execution for a time interval in the get_site_data.R function
It took less tar_make() runs to execute the pipeline.

Check your progress

Here are my answers to the above questions:

_Q: 2. Add 'IL' to the states target. Then call tar_make() again (you may have to run it multiple times to get passed [pretend] failures). It builds data_IL for you right? Cool! But there's something inefficient happening here, too - what is it? Can you guess why this is happening?_

A: It built WI_data, MN_data, and MI_data again even though there was no need to download those files again. This happened because those three targets each depend on oldest_active_sites, the inventory object, and that object changed to include information about a gage in Illinois. It would be ideal if each branch only depended on exactly the values that determine whether the data need to be downloaded again.

_Q: 3. Make a small change to the get_site_data() function: change Sys.sleep(2) to Sys.sleep(0.5). Then call tar_make() again (and again and again if you get [pretend] internet failures). What happened?_

A: It skipped oldest_active_sites and then rebuilt each of the branches, nwis_data_MI, nwis_data_MN, nwis_data_WI, and nwis_data_IL. targets knows that the function updated and that these targets depend on that function. So cool! But the change we made doesn't actually change the output files from this function, but targets doesn't know that; it noticed a change in the function and rebuilt all of the targets that used it. The good thing is that any targets that depend on these nwis_data_ targets would not rebuild because they wouldn't have changed since the last build. Also a reminder as to why it is a good idea to keep functions focused on smaller, specific activities. The more that the function does, the more opportunities there are for you to make updates/fixes/improvements, and you may end up rebuilding more than you want to.

We'll deal with (2) in the next issue.

:keyboard: Activity: Create a PR with your new branching technique

You now have a functioning pipeline that uses branching to download data for the oldest USGS streamgage in 4 different states! Go ahead and commit these changes to _targets.R to your "static-branching"" branch and then open a Pull Request.

elmeraa-usgs / ds-pipelines-targets-3

Branching #4

Definitions

:keyboard: Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "static-branching" branch.

Dynamic branching

Static branching:

How do you choose?

Add a comment to this issue to proceed.

:keyboard: Activity: Implement static branching to download data by state

Include appropriate packages

Replace state targets with branching code

Verify your task values

Check your progress

I'll respond when I see your comment.

:keyboard: Activity: Use `tar_make()` to build the pipeline with static branching

I'll respond when I see your comment.

Check your progress

:keyboard: Activity: Create a PR with your new branching technique

I'll respond when I see your PR.

elmeraa-usgs / ds-pipelines-targets-3

Branching #4

Definitions

:keyboard: Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "static-branching" branch.

Dynamic branching

Static branching:

How do you choose?

Add a comment to this issue to proceed.

:keyboard: Activity: Implement static branching to download data by state

Include appropriate packages

Replace state targets with branching code

Verify your task values

Check your progress

I'll respond when I see your comment.

:keyboard: Activity: Use tar_make() to build the pipeline with static branching

I'll respond when I see your comment.

Check your progress

:keyboard: Activity: Create a PR with your new branching technique

I'll respond when I see your PR.

:keyboard: Activity: Use `tar_make()` to build the pipeline with static branching