algorsky / ds-pipelines-targets-3

https://lab.github.com/USGS-R/many-task-pipelines-using-targets
0 stars 0 forks source link

Combiners #10

Closed github-learning-lab[bot] closed 2 years ago

github-learning-lab[bot] commented 2 years ago

So far we've implemented split and apply operations; now it's time to explore combine operations in targets pipelines.

In this issue you'll add two combiners to serve different purposes - the first will combine all of the annual observation tallies into one giant table, and the second will summarize the set of state-specific timeseries plots generated by the task table.

Background

Approach

Given your current level of knowledge, if you were asked to add a target combining the tally outputs you would likely add a call to tar_target and use the branches as input to a command that aggregated the data. While this would certainly work, the number of inputs to a combiner should change if the number of tasks changes. If we hand-coded a combiner target with tar_target that accepted a set of inputs (e.g., tar_target(combined_tallies, combine_tallies(tally_WI, tally_MI, [etc]))), we'd need to manually edit the inputs to that function anytime we changed the states vector. That would be a pain and would make our pipeline susceptible to human error if we forgot or made a mistake in that editing.

Implementation

The targets way to use combiners for static branching is to work with the tar_combine() function (recall that combiners are automatically applied to the output in dynamic branching). tar_combine() is set up in a similar way to tar_target(), where you supply the target name and a function to the target as the command. The difference is that the input to the command will be multiple targets passed in to the ... argument. The output from a tar_combine() can be an R object or a file, but file targets need to have format = "file" passed in to tar_combine() and the function used as command must return the filepath.

Some additional implementation considerations:

Don't worry if not all of this clicked yet. We are about to see it all in action!

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Switch to a new branch

Before you edit any code, create a local branch called "combiners" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout main
git pull origin main
git checkout -b combiners
git push -u origin combiners

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main and sync with "origin" whenever you're transitioning between branches and/or PRs.


Comment on this issue once you've created and pushed the "combiners" branch.

algorsky commented 2 years ago

I've created and pushed the "combiners" branch.

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Add a data combiner

Write combine_obs_tallies()

Prepare the makefile to use combine_obs_tallies()

Add your combiner target

Test

Run tar_make() and then tar_load(obs_tallies). Inspect the value of obs_tallies. Is it what you expected?

When you're feeling confident, add a comment to this issue with your answer to the question above.


I'll respond when I see your comment.

algorsky commented 2 years ago

It's a dataframe with 744 observations of the four variables , which is what I expected.

github-learning-lab[bot] commented 2 years ago

Check your progress

_Inspect the value of obs_tallies. Is it what you expected?_

Here's what my obs_tallies looks like. Your number of rows might vary slightly if you build this at a time when the available data have changed substantially, but the column structure and approximate number of rows ought to be about the same. If it looks like this, then it meets my expectations and hopefully also yours.

> obs_tallies
# A tibble: 738 x 4
# Groups:   Site, State [6]
   Site     State  Year NumObs
   <chr>    <chr> <dbl>  <int>
 1 04073500 WI     1898    365
 2 04073500 WI     1899    365
 3 04073500 WI     1900    365
 4 04073500 WI     1901    365
 5 04073500 WI     1902    365
 6 04073500 WI     1903    365
 7 04073500 WI     1904    366
 8 04073500 WI     1905    365
 9 04073500 WI     1906    365
10 04073500 WI     1907    365
# … with 728 more rows

Comment on this issue when you're ready to proceed.


I'll respond when I see your comment.

algorsky commented 2 years ago

Ready to proceed

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Use the combiner target downstream

It's time to reap the rewards from your first combiner.

When you've got it, share the image in 3_visualize/out/data_coverage.png as a comment.


I'll respond when I see your comment.

algorsky commented 2 years ago

image

github-learning-lab[bot] commented 2 years ago

Great, you have a combiner hooked up from start to finish, and you probably learned some things along the way! It's time to add a second combiner that serves a different purpose - here, rather than produce a target that contains the data of interest, we'll produce a combiner target that summarizes the outputs of interest (in this case the state-specific .png files we've already created).

Why do we need a summary target of outputs?

While this isn't necessary for the pipeline to operate, summarizing file output in large pipelines can be advantageous in some circumstances. Mainly, when we want to version control information about parts of the pipeline that were updated for ourselves or collaborators. We can't check in R object targets to GitHub and we usually avoid checking in data files (e.g. PNGs, CSVs, etc) to GitHub because of the file sizes. So, instead, we can combine some metadata about the file targets generated in the pipeline into a small text file, save in a log/ folder, and commit that to GitHub. Then, any future runs of the pipeline that change any of the metadata we include in the summary file would be tracked as a change to that file.

The first step is to write a custom function to take a number of target names and generate a summary file using output from tar_meta(). We will refer to this file as an indicator file, where the file lines indicate the hash of the file. We will save as a CSV so that individual lines of the CSV can be tracked as changed or not. See below for a function that does exactly this!

summarize_targets <- function(ind_file, ...) {
  ind_tbl <- tar_meta(c(...)) %>% 
    select(tar_name = name, filepath = path, hash = data) %>% 
    mutate(filepath = unlist(filepath))

  readr::write_csv(ind_tbl, ind_file)
  return(ind_file)
}

:keyboard: Activity: Add a summary combiner

Try this summary function

Prepare the makefile to use summarize_targets()

Test and revise summary_state_timeseries_csv

Hmm, you probably just discovered that 3_visualize/log/summary_state_timeseries.csv used summarize_targets() for the download, tally, AND plot steps of the static branching. We could do that but what we really wanted to know was the metadata status for the plot file outputs only.

When you're feeling confident, add a comment to this issue with the contents of 3_visualize/out/data_coverage.png, 3_visualize/log/summary_state_timeseries.csv, and the figure generated by tar_visnetwork().


I'll respond when I see your comment.

algorsky commented 2 years ago

image Screen Shot 2021-12-20 at 1 47 34 PM image

github-learning-lab[bot] commented 2 years ago

You're down to the last task for this issue! I hope you'll find this one rewarding. After all your hard work, you're now in a position to create a leaflet map that will give you interactive access to the locations, identities, and timeseries plots of the Upper Midwest's oldest gages, all in one .html map. Ready?

Use the plots downstream

Test

Make a pull request

It's finally time to submit your work.


I'll respond when I see your PR.