Closed github-learning-lab[bot] closed 2 years ago
Before you edit any code, create a local branch called "splitter" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout main
git pull origin main
git checkout -b splitter
git push -u origin splitter
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main
and sync with "origin" whenever you're transitioning between branches and/or PRs.
Commenting to continue on
[x] Add a new target to your tar_map()
call just before the nwis_data
target but below the values
input using this boilerplate:
tar_target(nwis_inventory, ),
[x] Add code to subset the rows in oldest_active_sites
based on the branching variable, state_abb
. Remember that oldest_active_sites
has a column called state_cd
containing the state abbreviations. Hint: go peek at the first line of the function get_site_data()
in 1_fetch/src/get_site_data.R
.
[x] Edit your call for the nwis_data
target to use nwis_inventory
instead of oldest_active_sites
to take advantage of your newly split data.
[x] Lastly, the first step in get_site_data()
that filters the input data is not longer needed (because that is taken care of in your new splitter step!). But careful - the incoming data is an argument called sites_info
but the rest of the function relies on site_info
(singular site
not sites
). So, delete that first line but then update the argument name to be singular, site_info
. Now you are good :)
When you think you've got it right, run your pipeline again!
tar_make()
You should now see targets being built called nwis_inventory_WI
, nwis_inventory_IL
, etc. It should redownload all of the data for WI, MN, MI, and IL (so rebuild nwis_data_WI
, nwis_data_MI
, etc) because we changed the inputs and the function for those targets. The real magic comes next.
If you're not quite getting the build to work, keep editing until you have it (but remember that there may still be "internet transfer failures" which require you to run tar_make()
a few times). When you've got it, copy and paste the console output of tar_make()
and tar_visnetwork()
into a comment on this issue.
You have a fancy new splitter, but you still need to see the benefits in action.
[x] Call tar_make()
one more time. Nothing should rebuild.
[x] Add Indiana (IN
) and Iowa (IA
) to the vector of states
in _targets.R. Rebuild. Did you see the rebuilds and non-rebuilds that you expected?
(If you're not sure what you should have expected, check with your course contact, or another teammate.)
Comfortable with your pipeline's behavior? Time for a PR!
git push
to push your change up to the "splitter" branch on GitHub.When everything is committed and pushed, create a pull request on GitHub. In your PR description note which targets got built when you added IN
and IA
to states
.
In the last issue you noted a lingering inefficiency: When you added Illinois to the
states
vector, your branching pipeline builtnwis_data_WI
,nwis_data_MN
, andnwis_data_MI
again even though there was no need to download those files again. This happened because those three targets each depend onoldest_active_sites
, the inventory target, and that target changed to include information about a gage in Illinois. As I noted in that issue, it would be ideal if each task branch only depended on exactly the values that determine whether the data need to be downloaded again. But we need a new tool to get there: a splitter.The splitter we're going to create in this issue will split
oldest_active_sites
into a separate table for each state. In this case each table will be just one row long, but there are plenty of situations where the inputs to a set of tasks will be larger, even after splitting into task-size chunks. Some splitters will be quick to run and others will take a while, but either way, we'll be saving ourselves time in the overall pipeline!Background
The object way to split
So far in our pipeline, we already have an object that contains the inventory information for all of the states,
oldest_active_sites
. Now, we can write a splitter to take the full inventory and one state name and return a one-row table.And then we could insert an initial branching step where we pulled out that state's information before passing it to the next step, such that our
tar_map()
call would look like:The file way to split
The "object way to split" described above works well in many cases, but note that
get_state_inventory()
is called for each of our task targets (so each state). Suppose thatoldest_active_sites
was a file that took a long time to read in - we've encountered cases like this for large spatial data files, for example - you'd have to re-open the file for each and every call toget_state_inventory()
, which would be excruciatingly slow for a many-state pipeline. If you find yourself in that situation, you can approach "splitting" with files rather than objects.Instead of calling
get_state_inventory()
once for each state, we could and write a single splitter function that acceptsoldest_active_sites
and writes a single-row table for each state. It will be faster to run because there will not be redundant reloading of the data that is needing to be split. This type of splitter would not be within your branching code and instead return a single summary table describing the state-specific files that were just created.For this next exercise, the object method for splitting described before will suit our needs just fine. There is no need to create a single splitter function that saves state-specific files for now. We are mentioning it here so that you can be aware of the limitations of splitters and be aware that other options exist.
Your mission
In this issue you'll create a splitter to make your task table more efficient in the face of a changing inventory in
oldest_active_sites
. Your splitter function will generate separate one-row inventory data for each state.Ready?