Combiners - Githubissues

github-learning-lab[bot] commented 2 years ago

So far we've implemented split and apply operations; now it's time to explore combine operations in targets pipelines.

In this issue you'll add two combiners to serve different purposes - the first will combine all of the annual observation tallies into one giant table, and the second will summarize the set of state-specific timeseries plots generated by the task table.

Background

Approach

Given your current level of knowledge, if you were asked to add a target combining the tally outputs you would likely add a call to tar_target and use the branches as input to a command that aggregated the data. While this would certainly work, the number of inputs to a combiner should change if the number of tasks changes. If we hand-coded a combiner target with tar_target that accepted a set of inputs (e.g., tar_target(combined_tallies, combine_tallies(tally_WI, tally_MI, [etc]))), we'd need to manually edit the inputs to that function anytime we changed the states vector. That would be a pain and would make our pipeline susceptible to human error if we forgot or made a mistake in that editing.

Implementation

The targets way to use combiners for static branching is to work with the tar_combine() function (recall that combiners are automatically applied to the output in dynamic branching). tar_combine() is set up in a similar way to tar_target(), where you supply the target name and a function to the target as the command. The difference is that the input to the command will be multiple targets passed in to the ... argument. The output from a tar_combine() can be an R object or a file, but file targets need to have format = "file" passed in to tar_combine() and the function used as command must return the filepath.

Some additional implementation considerations:

In order to use tar_combine() with the output from tar_map(), you will need to save the output of tar_map() as an object. Thus, the branching declaration should look something like mapped_output <- tar_map() so that mapped_output can be used in your tar_combine() call.
You can write your own combiner function or you can use built-in combiner functions for common types of combining (such as bind_rows(), c(), etc). If you write your own combiner function, it needs to be in a script sourced in the makefile using source(). The default combiner is ?vctrs::vec_c, which is a a fancy version of c() that ensures the resulting vector keeps the common data type (e.g. factors remain factors).
When you pass the output of tar_map() to tar_combine(), all branch output from tar_map() will be used by default. If you had multiple steps in your tar_map() (i.e. multiple calls to tar_target()), and you only want to combine results from one of those, you can add unlist = FALSE to your tar_map() call so that the tar_map() output remained in a nested list. This makes it possible to reference just the output from each tar_target() and use in tar_combine(). For example, if you had three steps in your tap_map() call and you wanted to combine only those branches from the third step that had a target name of sum_resuts, you could use mapped_output[[3]] or mapped_output$sum_results as the input to tar_combine().
Within your tar_combine() function, pass the ... to your command function by specifying !!!.x in its place. It feels strange, but has to do with how the function handles non-standard evaluation. You can see an example of using this syntax when you look at the default for command in thehelp file for ?tarchetypes::tar_combine().
When specifying the command argument to tar_combine(), you need to include the argument, e.g. command = my_function(). Since tar_combine() has ... as its second argument, anything else you pass in without the argument name will be considered part of .... It can result in some weird errors.

Don't worry if not all of this clicked yet. We are about to see it all in action!

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Switch to a new branch

Before you edit any code, create a local branch called "combiners" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout main
git pull origin main
git checkout -b combiners
git push -u origin combiners

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "combiners" branch.

algorsky commented 2 years ago

I've created and pushed the "combiners" branch.

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Add a data combiner

Write `combine_obs_tallies()`

[ ] Add a new function called combine_obs_tallies somewhere in 2_process/src/tally_site_obs.R. The function declaration should be function(...); when the function is actually called, you can anticipate that the arguments will be a bunch of tallies tibbles (tidyverse data frames). Your function should return the concatenation of these tibbles into one very tall tibble.
[ ] Test your combine_obs_tallies() function. Run
```
source('2_process/src/tally_site_obs.R') # load `combine_obs_tallies()`
tar_load(tally_WI)
tar_load(tally_MN)
tar_load(tally_IL)
combine_obs_tallies(tally_WI, tally_MN, tally_IL)
```
The result should be a tibble with four columns and as many rows as the sum of the number of rows in WI_tally, MN_tally, and IL_tally. If you don't have it right yet, keep fiddling and/or ask for help.

Prepare the makefile to use `combine_obs_tallies()`

[ ] Move your static branching setup outside of your targets list and save above as an object called mapped_by_state_targets. It should look something like
```
mapped_by_state_targets <- tar_map(...)

list(
tar_target(oldest_active_sites, ...),

tar_target(site_map_png, ...)
)
```
[ ] Now add mapped_by_state_targets as a target between oldest_active_sites and site_map_png in your list of targets.
[ ] Add unlist=FALSE to your tar_map() call, so that we can reference only the branch targets from the tally step in tar_combine().

Add your combiner target

[ ] Add a new target between mapped_by_state_targets and the site_map_png target called obs_tallies. Instead of tar_target(), this will use tar_combine().
[ ] Populate your tar_combine() call with input for just the tally branches by subsetting the tar_map() output object, and the appropriate call to combine_obs_tallies() for the command (remember you will need !!!.x).

Test

Run tar_make() and then tar_load(obs_tallies). Inspect the value of obs_tallies. Is it what you expected?

When you're feeling confident, add a comment to this issue with your answer to the question above.

I'll respond when I see your comment.

algorsky commented 2 years ago

It's a dataframe with 744 observations of the four variables , which is what I expected.

github-learning-lab[bot] commented 2 years ago

Check your progress

_Inspect the value of obs_tallies. Is it what you expected?_

Here's what my obs_tallies looks like. Your number of rows might vary slightly if you build this at a time when the available data have changed substantially, but the column structure and approximate number of rows ought to be about the same. If it looks like this, then it meets my expectations and hopefully also yours.

> obs_tallies
# A tibble: 738 x 4
# Groups:   Site, State [6]
   Site     State  Year NumObs
   <chr>    <chr> <dbl>  <int>
 1 04073500 WI     1898    365
 2 04073500 WI     1899    365
 3 04073500 WI     1900    365
 4 04073500 WI     1901    365
 5 04073500 WI     1902    365
 6 04073500 WI     1903    365
 7 04073500 WI     1904    366
 8 04073500 WI     1905    365
 9 04073500 WI     1906    365
10 04073500 WI     1907    365
# … with 728 more rows

Comment on this issue when you're ready to proceed.

I'll respond when I see your comment.

algorsky commented 2 years ago

Ready to proceed

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Use the combiner target downstream

It's time to reap the rewards from your first combiner.

[ ] Create a new target in _targets.R that takes advantage of your new combined tallies. Use the plot_data_coverage() function already defined for you (find it by searching or browsing the repository - remember Ctrl-Shift-F.), and pass in obs_tallies as the oldest_site_tallies argument. Set up your target to create a file named "3_visualize/out/data_coverage.png" and name the target appropriately. Remember to add a source() call to load the file with the new function near the top of _targets.R. Add this to your list() of targets after obs_tallies but before site_map_png, so that it is connected to the main pipeline.
[ ] Test your new target by running tar_make(), then checking out 3_visualize/out/data_coverage.png.
[ ] Test your new pipeline by removing a state from states and running tar_make() once more. Did 3_visualize/out/data_coverage.png get revised? If not, see if you can figure out how to make it so. Ask for help if you need it.

When you've got it, share the image in 3_visualize/out/data_coverage.png as a comment.

I'll respond when I see your comment.

algorsky commented 2 years ago

github-learning-lab[bot] commented 2 years ago

Great, you have a combiner hooked up from start to finish, and you probably learned some things along the way! It's time to add a second combiner that serves a different purpose - here, rather than produce a target that contains the data of interest, we'll produce a combiner target that summarizes the outputs of interest (in this case the state-specific .png files we've already created).

Why do we need a summary target of outputs?

While this isn't necessary for the pipeline to operate, summarizing file output in large pipelines can be advantageous in some circumstances. Mainly, when we want to version control information about parts of the pipeline that were updated for ourselves or collaborators. We can't check in R object targets to GitHub and we usually avoid checking in data files (e.g. PNGs, CSVs, etc) to GitHub because of the file sizes. So, instead, we can combine some metadata about the file targets generated in the pipeline into a small text file, save in a log/ folder, and commit that to GitHub. Then, any future runs of the pipeline that change any of the metadata we include in the summary file would be tracked as a change to that file.

The first step is to write a custom function to take a number of target names and generate a summary file using output from tar_meta(). We will refer to this file as an indicator file, where the file lines indicate the hash of the file. We will save as a CSV so that individual lines of the CSV can be tracked as changed or not. See below for a function that does exactly this!

summarize_targets <- function(ind_file, ...) {
  ind_tbl <- tar_meta(c(...)) %>% 
    select(tar_name = name, filepath = path, hash = data) %>% 
    mutate(filepath = unlist(filepath))

  readr::write_csv(ind_tbl, ind_file)
  return(ind_file)
}

:keyboard: Activity: Add a summary combiner

Try this summary function

[ ] Inspect the code within summarize_targets()
[ ] Run the code to create summarize_targets() as a function in your local environment.
[ ] Test it out with a command such as
```
summarize_targets('test.csv', site_map_png, oldest_active_sites)
```
Check out the contents of test.csv. Then when you're feeling clear on what happened, delete test.csv and clear your R Global Environment.

Prepare the makefile to use `summarize_targets()`

[x] Copy/paste the summarize_targets() function to its own R script called 2_process/src/summarize_targets.R.
[x] Add this new file to the pipeline by including a call to source() near the top of _targets.R.
[x] Add another target after obs_tallies to build this second combiner. The new line should be:
```
tar_combine(
summary_state_timeseries_csv,
mapped_by_state_targets,
command = summarize_targets('3_visualize/log/summary_state_timeseries.csv', !!!.x),
format="file"
)
```
Note the use of the log/ directory. The template repo had already set up any src/ and out/ folders for you, but 3_visualize/log/ does not exist yet. Before you can build this target, you will need to create this directory. Otherwise, the pipeline will throw an error.
[ ] Run tar_make(). Inspect '3_visualize/log/summary_state_timeseries.csv'. Is that what you expect?

Test and revise `summary_state_timeseries_csv`

Hmm, you probably just discovered that 3_visualize/log/summary_state_timeseries.csv used summarize_targets() for the download, tally, AND plot steps of the static branching. We could do that but what we really wanted to know was the metadata status for the plot file outputs only.

[ ] Adjust the input to tar_combine() for summary_state_timeseries_csv so that ONLY the timeseries plot step of mapped_by_state_targets is being passed into the combiner function.
[ ] Now run tar_make() again, and check out 3_visualize/log/summary_state_timeseries.csv once more. Do you only have the PNG files showing up now?

When you're feeling confident, add a comment to this issue with the contents of 3_visualize/out/data_coverage.png, 3_visualize/log/summary_state_timeseries.csv, and the figure generated by tar_visnetwork().

I'll respond when I see your comment.

algorsky commented 2 years ago

Screen Shot 2021-12-20 at 1 47 34 PM

github-learning-lab[bot] commented 2 years ago

You're down to the last task for this issue! I hope you'll find this one rewarding. After all your hard work, you're now in a position to create a leaflet map that will give you interactive access to the locations, identities, and timeseries plots of the Upper Midwest's oldest gages, all in one .html map. Ready?

Use the plots downstream

[ ] Add another target to _targets.R that uses the function map_timeseries() (defined for you in 3_visualize). site_info should be oldest_active_sites, plot_info should be summary_state_timeseries_csv, and the output should be written to 3_visualize/out/timeseries_map.html. Name this target appropriately and put as the final target in your list.
[ ] Add the three packages that map_timeseries() requires to the declaration in tar_option_set() at the top of _targets.R: leaflet, leafpop, and htmlwidgets.

Test

[ ] Run tar_make(). Any surprises?
[ ] Check out the results of your new map by opening 3_visualize/out/timeseries_map.html in the browser. You should be able to hover and click on each marker.
[ ] Add or subtract a state from the states vector and rerun tar_make(). Did you see the rebuilds and non-rebuilds that you expected? Did the html file change as expected?

Make a pull request

It's finally time to submit your work.

[ ] Commit your code changes for this issue and make sure you're .gitignoreing the new analysis products (the .png and .html files), but include your new file in the log/ directory. Push your changes to the GitHub repo.
[ ] Create a PR to merge the "combiners" branch into "main". Share a screenshot of 3_visualize/out/timeseries_map.html and any thoughts you want to share in the PR description.

algorsky / ds-pipelines-targets-3

Combiners #10

Background

Approach

Implementation

:keyboard: Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "combiners" branch.

:keyboard: Activity: Add a data combiner

Write `combine_obs_tallies()`

Prepare the makefile to use `combine_obs_tallies()`

Add your combiner target

Test

I'll respond when I see your comment.

Check your progress

I'll respond when I see your comment.

:keyboard: Activity: Use the combiner target downstream

I'll respond when I see your comment.

Why do we need a summary target of outputs?

:keyboard: Activity: Add a summary combiner

Try this summary function

Prepare the makefile to use `summarize_targets()`

Test and revise `summary_state_timeseries_csv`

I'll respond when I see your comment.

Use the plots downstream

Test

Make a pull request

I'll respond when I see your PR.

algorsky / ds-pipelines-targets-3

Combiners #10

Background

Approach

Implementation

:keyboard: Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "combiners" branch.

:keyboard: Activity: Add a data combiner

Write combine_obs_tallies()

Prepare the makefile to use combine_obs_tallies()

Add your combiner target

Test

I'll respond when I see your comment.

Check your progress

I'll respond when I see your comment.

:keyboard: Activity: Use the combiner target downstream

I'll respond when I see your comment.

Why do we need a summary target of outputs?

:keyboard: Activity: Add a summary combiner

Try this summary function

Prepare the makefile to use summarize_targets()

Test and revise summary_state_timeseries_csv

I'll respond when I see your comment.

Use the plots downstream

Test

Make a pull request

I'll respond when I see your PR.

Write `combine_obs_tallies()`

Prepare the makefile to use `combine_obs_tallies()`

Prepare the makefile to use `summarize_targets()`

Test and revise `summary_state_timeseries_csv`