Closed github-learning-lab[bot] closed 2 years ago
Before you edit any code, create a local branch called "static-branching" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout main
git pull origin main
git checkout -b static-branching
git push -u origin static-branching
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main
and sync with "origin" whenever you're transitioning between branches and/or PRs.
commenting to continue on
Before we get to editing, let's briefly discuss the differences between static and dynamic branching, and how you choose between them.
In dynamic branching, your tasks are defined by another target in your pipeline. They are dynamic because the task targets can change while the pipeline is being built. This is particular useful when the tasks depend on some number of files that may change through time as you run your pipeline. Read more about the key parts of dynamic branching below.
tar_target()
as you usually would (where the command
you pass in represents one step) but adding the argument pattern
to define how to split a previous target into tasks. Typically, you will see map()
used to define the pattern
for splitting up a target into tasks, but you can find details about other options available here in the targets
documentation. tar_target()
calls with pattern
specified.tar_pattern()
and iterate on your branching structure before actually executing the pipeline.Building on the example introduced previously, where we need to download, tally, and then plot data for multiple states, here is what dynamic branching could look like:
library(targets)
library(tidyverse)
library(tarchetypes)
# Add source calls to files containing `get_nwis_data()`, `tally()`, and `plot()` here
list(
tar_target(states, c('WI', 'MN', 'MI')),
tar_target(data, get_nwis_data(states), pattern = map(states)),
tar_target(count, tally(data), pattern = map(data)),
tar_target(fig, plot(count), pattern = map(count))
)
In static branching, your tasks are defined by a named list or data.frame passed into your branching command. This is static because the tasks created won't update with the pipeline. You would need to update the list or data.frame. Read more about the key parts of static branching below.
tar_map()
function from the package tarchetypes
. First, you pass in your tasks as a named list or data.frame into the values
argument. Then, you can set up a step by adding a call to tar_target()
and using the column or list element name containing the unique tasks as an argument to your command
function. tar_target()
as arguments to tar_map()
. tar_map()
, add a target for tar_combine()
, where you pass in the output from tar_map()
and then specify the command used to combine the results into one object.tar_visnetowrk()
, as we have done to inspect our pipelines without branching. Once you are happy with your branching set up, you can execute the pipeline.Going back once again to the pipeline where we need to download, tally, and then plot data for multiple states, here is what the static branching version would look like:
library(targets)
library(tidyverse)
library(tarchetypes)
# Add source calls to files containing `get_nwis_data()`, `tally()`, and `plot()` here
tasks <- tibble(states = c('WI', 'MN', 'MI'))
list(
tar_map(
values = tasks,
tar_target(data, get_nwis_data(states)),
tar_target(count, tally(data)),
tar_target(fig, plot(count))
)
)
How do you know when to use dynamic or static branching? This is tricky because both will work in many scenarios (as we saw above), but it ultimately comes down to how your tasks are defined.
When your tasks are predefined (e.g. states, a few basins, a specific set of user-defined sites), it makes sense to use static branching (though you can use dynamic as illustrated with our examples above). This doesn't mean you can't manually add a few additional tasks (e.g. include more states) but it means that adding more is a manual step that the user needs to remember to do. One pro to using static branching in these instances is that you can visualize your branches with the rest of your pipeline using tar_visnetwork()
, whereas you cannot visualize the branches when using the dynamic branching approach.
When your tasks could change based on previous parts of your pipeline, you should choose dynamic branching. Examples of this include iterating over files in a directory (the files could change!) or using an inventory of sites to then pull data (when the inventory reruns, it may return different sites). A con is not being able to visualize your branches ahead of time, but you can still inspect them by running tar_pattern()
. A pro with dynamic branching is that it follows the same pattern as all of your other targets, by using tar_target()
with just one additional argument specified. Another pro for dynamic branching is that your output from each branch is automatically combined into one target.
With that intro out of the way, let's get going on implementing code for branching already!
Adding comment to continue on
Now that you have learned about branching, let's add it to our code. Currently, you have 3 individual targets that will download site data from our 3 Midwest states and store in a target named with the state name. Those targets look something like this:
tar_target(wi_data, get_site_data(oldest_active_sites, states[1], parameter)),
tar_target(mn_data, get_site_data(oldest_active_sites, states[2], parameter)),
tar_target(mi_data, get_site_data(oldest_active_sites, states[3], parameter)),
We are going to convert the code for those targets into static branching. We are going to make these changes on the "static-branching" branch that we created earlier. Let's get started!
Now that we are using static branching, our main pipeline makefile will need the tarchetypes
package. In addition, we will use tibble::tibble()
to define our task data.frame. Make these two packages available to the targets pipeline by adding the following to the other library calls in _targets.R
:
library(tarchetypes)
library(tibble)
To get started, copy the code below and replace your 3 individual state targets (shown above) with it.
tar_map(
values = tibble(state_abb = states),
tar_target(data, get_site_data(oldest_active_sites, state_abb, parameter))
# Insert step for tallying data here
# Insert step for plotting data here
),
This is already set up for you, but is worth going over. Your task names are passed into tar_map()
using the argument values
. This argument will accept a list or a data.frame/tibble and the names of the list elements or columns are used as arguments to the functions in your steps. Rather than change the states
object outside of tar_map()
(because that would require us to also update oldest_active_sites
which uses states
), we are using that vector to create a column called state_abb
in the tibble
passed to values
. That means, when we need to pass in the task names as an argument to a function, we use state_abb
, the column name containing those task names.
You have already learned about tar_visnetwork()
as a way to visualize your pipeline before running it. By default, it will show targets and functions. We would just like to check that our branch targets are set up appropriately, so try running tar_visnetwork(targets_only = TRUE)
to get a visual with just targets. You should see something similar to the image below, where there are three targets prefixed with data_
.
Those targets prefixed with data_
are the branches (targets per task-step) for the get_site_data()
step. They are automatically named using the target name you pass to tar_target()
+ an _
+ the task identifier. You can test this by changing that target name from data
to nwis_data
and re-running tar_visnetwork(targets_only = TRUE)
. You should now see the branches nwis_data_MI
, nwis_data_MN
, and nwis_data_WI
in the visual.
You can also use a function called tar_manifest()
to check your pipeline before running. It will return a table of information about each target and the function call that will be used to create it. Try running tar_manifest()
. You should see
# A tibble: 5 x 3
name command pattern
<chr> <chr> <chr>
1 oldest_active_sites "find_oldest_sites(states, parameter)" NA
2 nwis_data_MI "get_site_data(oldest_active_sites, \"MI\", parameter)" NA
3 nwis_data_MN "get_site_data(oldest_active_sites, \"MN\", parameter)" NA
4 nwis_data_WI "get_site_data(oldest_active_sites, \"WI\", parameter)" NA
5 site_map_png "map_sites(\"3_visualize/out/site_map.png\", oldest_active_sites)" NA
If your pipeline doesn't look as you expect it should, keep iterating on your code in the _targets.R
file. When you're happy with your pipeline, run tar_manifest(starts_with('nwis_data'))
to see the details for just the branches. Copy and paste the output into a new comment on this issue.
# A tibble: 3 × 3
name command pattern
<chr> <chr> <chr>
1 nwis_data_MI "get_site_data(oldest_active_sites, \"MI\", parameter)" NA
2 nwis_data_MN "get_site_data(oldest_active_sites, \"MN\", parameter)" NA
3 nwis_data_WI "get_site_data(oldest_active_sites, \"WI\", parameter)" NA ```
Now that you have branching set up for downloading data from NWIS, it is time to run the pipeline!
tar_make()
to build the pipeline with static branchingRun tar_make()
to execute your pipeline. You may have to call tar_make()
a few times to get through any [pretend] failures in the data pulls (I had to run it 5 times), but ultimately you should have seen something like this output:
> tar_make()
v skip target oldest_active_sites
v skip target nwis_data_MI
v skip target nwis_data_MN
* run target nwis_data_WI
Retrieving data for WI-04073500
* run target site_map_png
* end pipeline
If you're not there yet, keep trying until your output has only *
or v
next to the output. Then proceed:
Call tar_make()
one more time. You should see a green "V" next to each target.
Add 'IL'
to the states
target. Then call tar_make()
again (you may have to run it multiple times to get passed [pretend] failures). It builds data_IL
for you right? Cool! But there's something inefficient happening here, too - what is it? Can you guess why this is happening?
Make a small change to the get_site_data()
function: change Sys.sleep(2)
to Sys.sleep(0.5)
. Then call tar_make()
again (and again and again if you get [pretend] internet failures). What happened?
Answer the questions from 2 and 3 above in a new comment on this issue.
tar_make()
since we're getting a connection error that was set up to suspend the execution for a time interval in the get_site_data.R
functiontar_make()
runs to execute the pipeline. Here are my answers to the above questions:
_Q: 2. Add 'IL'
to the states
target. Then call tar_make()
again (you may have to run it multiple times to get passed [pretend] failures). It builds data_IL
for you right? Cool! But there's something inefficient happening here, too - what is it? Can you guess why this is happening?_
A: It built WI_data
, MN_data
, and MI_data
again even though there was no need to download those files again. This happened because those three targets each depend on oldest_active_sites
, the inventory object, and that object changed to include information about a gage in Illinois. It would be ideal if each branch only depended on exactly the values that determine whether the data need to be downloaded again.
_Q: 3. Make a small change to the get_site_data()
function: change Sys.sleep(2)
to Sys.sleep(0.5)
. Then call tar_make()
again (and again and again if you get [pretend] internet failures). What happened?_
A: It skipped oldest_active_sites
and then rebuilt each of the branches, nwis_data_MI
, nwis_data_MN
, nwis_data_WI
, and nwis_data_IL
. targets knows that the function updated and that these targets depend on that function. So cool! But the change we made doesn't actually change the output files from this function, but targets doesn't know that; it noticed a change in the function and rebuilt all of the targets that used it. The good thing is that any targets that depend on these nwis_data_
targets would not rebuild because they wouldn't have changed since the last build. Also a reminder as to why it is a good idea to keep functions focused on smaller, specific activities. The more that the function does, the more opportunities there are for you to make updates/fixes/improvements, and you may end up rebuilding more than you want to.
We'll deal with (2) in the next issue.
You now have a functioning pipeline that uses branching to download data for the oldest USGS streamgage in 4 different states! Go ahead and commit these changes to _targets.R
to your "static-branching"" branch and then open a Pull Request.
In the last issue you noted some inefficiencies with writing out many nearly-identical targets within a remake.yml:
In this issue we'll fix those inefficiencies by adopting the branching approach supported by targets and the support package tarchetypes.
Definitions
Branching in targets refers to the approach of scaling up a pipeline to accomodate many tasks. It is the targets implementation of the split-apply-combine operation. In essence, we split a dataset into some number of tasks, then to each task we apply one or more analysis steps. Branches are the resulting targets for each unique task-and-step match.
In the example analysis for this course, each task is a state and the first step is a call to
get_site_data()
for that state's oldest monitoring site. Later we'll create additional steps for tallying and plotting observations for each state's site. See the image below for a conceptual model of branching for this course analysis.We implement branching in two ways: as static branching, where the task targets are predefined before the pipeline runs, and dynamic branching, where task targets are defined while the pipeline runs.
In this issue you'll adjust the existing pipelining to use branching for this analysis of USGS's oldest gages.