DOI-USGS / ds-pipelines-targets-3-course

Many-task pipelines using targets
Other
4 stars 6 forks source link

fast vs slow way to split in targets? #8

Closed lindsayplatt closed 3 years ago

lindsayplatt commented 3 years ago

I was re-reading the course one last time to make sure everything worked and made sense. I walked through the splitters exercise and realized that the "slow way to split" described here is exactly the way I ended up doing it (see here).

So, I am wondering if we should teach that saving individual files for each state's split inventory from one function call like we would do in scipiper is the better approach (ignoring dynamic branching which comes later)? Or does it not matter as much in targets and a lot of this text should be deleted instead?

jzwart commented 3 years ago

At least for this case, the speed doesn't really matter, but it would be good to have an example of reading in the data file only once and still creating site-specific targets. If we're to write out one file for every state, then we'd also need a summary file that describes if the state-specific files have changed, right? I haven't done this with targets but it should be doable.

To only read in the sites_info once, could we first create an object that are the nwis_inventory for all states and then split that object by the states to create state-specific targets. I think this might work and still be consistent with the targets rules.

lindsayplatt commented 3 years ago

For the first part, I haven't written out multiple files per target before but I do have an example later in this course about saving those types of summary files that I can reuse as part of this example.

For your second point, I think that is already what happens? That is the equivalent of what the nwis_inventory target does.

limnoliver commented 3 years ago

I don't have a strong preference to "slow" vs "fast" approach, but I do think this example is still relevant in targets. So I don't think we need to present it as the "better" approach, but one approach that can solve this issue it's a deal breaker for you.

lindsayplatt commented 3 years ago

OK, so thought through this all a bit more this morning. What do people think of this?

The object way to split

In general, targets best practices require that each command creates exactly one output, either an R object or a file. To follow this policy, we could write a function that would take the full inventory and one state name and return a one-row table.

get_state_inventory <- function(sites_info, state) {
  site_info <- dplyr::filter(sites_info, state_cd == state)
}

And then we could insert an initial branching step where we pulled out that state's information before passing it to the next step, such that our tar_map() call would look like:

tar_map(
  values = tibble(state_abb = states),
  tar_target(nwis_inventory, get_state_inventory(sites_info = oldest_active_sites, state_abb)),
  tar_target(nwis_data, get_site_data(nwis_inventory, state_abb, parameter))
)

The file way to split

The "object way to split" described above works well in many cases, but note that get_state_inventory() is called for each of our task targets (so each state). Suppose that oldest_active_sites was a file that took a long time to read in - we've encountered cases like this for large spatial data files, for example - you'd have to re-open the file for each and every call to get_state_inventory(), which would be excruciatingly slow for a many-state pipeline. If you find yourself in that situation, you can approach "splitting" with files rather than objects.

Instead of calling get_state_inventory() once for each state, we could and write a single splitter function that accepts oldest_active_sites and writes a single-row table for each state. It will be faster to run because there will not be redundant reloading of the data that is needing to be split. This type of splitter would not be within your branching code and instead return a single summary table describing the state-specific files that were just created.

For this next exercise, the object method for splitting described before will suit our needs just fine. There is no need to create a single splitter function that saves state-specific files for now. We are mentioning it here so that you can be aware of the limitations of splitters and be aware that other options exist.

aappling-usgs commented 3 years ago

In general, targets best practices require that each command creates exactly one output, either an R object or a file.

Are we firm on this? I thought that our 1:1 guideline was more of a practical nod to remake's limitations. Is it still impractical to, say, use tar_group() when creating the inventory and then define tar_target(nwis_data, get_site_data(inventory, parameter), pattern=map(inventory))? Or is this 1:1 guideline meant to apply only to static mapping? What am I missing?

lindsayplatt commented 3 years ago

Oh you are right. That sentence is wrong. What you mention above is exactly what this course walks a user through when we get to dynamic branching, but we wanted to talk about static branching first as we introduce the idea bc the tar_visnetwork() output and console messages are easier to understand. Still, I don't think the 1:1 guideline is appropriate though. That intro paragraph should look more like this:

We already have an object that contains the information for all of the states. Now, we can write a splitter to take the full inventory and one state name and return a one-row table.

aappling-usgs commented 3 years ago

Phew! The new text works for me. Is there text above it that says why you'd want a splitter? (Oh, yes, found it here. Everything looks great.)

lindsayplatt commented 3 years ago

Oh yes! Just was including the part that is being changed related to fast vs slow. See https://github.com/USGS-R/ds-pipelines-targets-3-course/blob/main/responses/30-about-splitters.md