cnell-usgs / ds-pipelines-2

https://lab.github.com/USGS-R/scipiper-tips-and-tricks
0 stars 0 forks source link

How to get past the gotchas without getting gotten again #8

Closed github-learning-lab[bot] closed 4 years ago

github-learning-lab[bot] commented 4 years ago

In the remaining section, we're going to go one by one through a series of tips that will help you avoid common pitfalls (or gotchas!) in pipelines. A quick list of what's to come:

:keyboard: add a comment to this issue and the bot will respond with the next topic


I'll sit patiently until you comment

cnell-usgs commented 4 years ago

I'm ready

github-learning-lab[bot] commented 4 years ago

How to inspect parts of the pipeline and variables within functions

If you've written your own functions or scripts before, you may have run into the red breakpoint dot :red_circle: on the left side oyour script window:

breakpoint

Breakpoints allow you to run a function (or script) up until the line of the breakpoint, and then the evaluation pauses. You are able to inspect all variables available at that point in the evaluation, and even step carefully forward one line at a time. It is out of scope to go through exactly how to use debuggers, but they are powerful and helpful tools. It would be a good idea to read up on them if you haven't run into breakpoints yet.

In scipiper, you can't set a breakpoint in the "normal" way, which would be clicking on the line number after you sourced the script. Instead, you need to use the other method for debugging, which is actually adding the function call browser() to the line at which you'd like the function call to stop.

So, if you wanted to look at what download_files where created within the download_nwis_data function, you could set a breakpoint by adding browser():

browser()

Then, running scmake() will land you right in the middle of line 8.


:keyboard: comment on where you might have used browser() in the previous assignment to reconstruct targets.


I'll sit patiently until you comment

cnell-usgs commented 4 years ago

this may have been useful in my combine_nwis_data() function to make sure the new targets were working and pulled into the combined data frame

github-learning-lab[bot] commented 4 years ago

What are cyclical dependencies and how to avoid them?

The remake package has a nice function called diagram() that we haven't covered yet. It produces a dependency diagram for the target(s) you specify (remember that all is the default target). For the initial example in your repo, it looks like this

remake::diagram()

image

Note the arrows capturing the dependency flow, and how site_data sits at the top, since it is the first target that needs to be built.

Also note that there are no backward looking arrows. What if site_data relied on site_data_styled? In order to satisfy that relationship, an arrow would need to swing back up from site_data_styled and connect with site_data. Unfortunately, this creates a cyclical dependency since changes to one target change the other target and changes to that target feed back and change the original target...so on, and so forth...

This potentially infinite loop is confusing to think of and also something that dependency managers can't support. We won't say much more about this issue here, but note that in the early days of building pipelines if you run into a cyclical dependency error, this is what's going on.


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

cnell-usgs commented 4 years ago

keep going

github-learning-lab[bot] commented 4 years ago

Creating side-effect targets or undocumented inputs

Moving into a pipeline way of thinking can reveal some suprising habits you created when working under a different paradigm. Moving the work of scripts into functions is one thing that helps compartmentalize thinking and organize data and code relationships, but going further with how these functions are designed is important.

side-effect targets

It is tempting to build functions that do several things; perhaps a plotting function also writes a table, or a data munging function returns a data.frame, but also writes a log file. You may have noticed that there is easy way to specify two ouputs from a single function in a pipeline recipe. We can have multiple files as inputs into a function/command, but only one output/target. If a function writes a file that is not explicitly connected in the recipe (i.e., it is a "side-effect" output), the file is untracked by the dependency manager, and treated like an implicit file target (see #5 for more details on this type of target). If the side-effect file is relied upon by a later target, changes to the side-effect target will indeed trigger a rebuild of the downstream target, but the dependency manager will have no way of knowing when the side-effect target itself should be rebuilt. :no_mobile_phones:

Maybe the above doesn't sound like a real issue, since the side-effect target would be updated every time the other explicit target it is paired with is rebuilt. But this becomes a scary problem (and our first real gotcha!) if the explicit target is not connected to the critical path of the final sets of targets you want to build, but the implicit side-effect target is. What this means is that even if the explicit target is out of date, it will not be rebuilt because building this target is unnecessary to completing the final targets (remember "skip the work you don't need" :arrow_right_hook:). The dependency manager doesn't know that there is a hidden rule for updating the side-effect target and that this update is necessary for assuring the final targets are up-to-date and correct. :twisted_rightwards_arrows:

Side-effect targets can be used effectively, but doing so requires a good understanding of implications for tracking them and advanced strategies on how to specify rules and dependencies in a way that carries them along. :ballot_box_with_check:

undocumented inputs


Additionally, it is tempting to code a filepath within a function that the function needs to access in order to run. This seems harmless, since functions are tracked by the dependency manager and any changes to those will trigger rebuilds, right? Not quite. If a file like "1_fetch/in/my_metadata.csv" is specified in a function, but is not also an argument to the command in the remakefile recipe or a listed as a depends of the resulting target, any changes to the "1_fetch/in/my_metadata.csv" will go unnoticed by the dependency manager, since the string that specifies the file name remains unchanged. The system isn't smart enough to know that it needs to check whether that file has changed.

As a rule, unless you are purposefully trying to hide changes in a file from the dependency manager, do not read non-argument files in the body of a function. :end:


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

cnell-usgs commented 4 years ago

great, there are important points for me to understand

github-learning-lab[bot] commented 4 years ago

How (not to) depend on a directory for changes

You might have a project where there is a directory :file_folder: with a collection of files. To simplify the example, assume all of the files are .csv and have the same format. As part of the hypothetical project goals, these files need to be combined and formatted into a single plottable data.frame.

In a data pipeline, we'd want assurance that any time the number of files changes, we'd rebuild the resulting data.frame. Likewise, if at any point the contents of any one of the files changes, we'd also want to re-build the data.frame.

This hypothetical example could be coded as

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  plot_data:
    command: combine_into_df('1_fetch/in/file1.csv','1_fetch/in/file2.csv','1_fetch/in/file3.csv')

  figure_1.png:
    command: my_plot(plot_data)

:point_up: This coding would work, as it tells the dependency manager which files to track for changes, and where the files in the directory are. But this solution is less than ideal, both because it doesn't scale well to many files, and because it doesn't adapt to new files coming into the 1_fetch/in directory :file_folder: (the pipeline coder needs to manually add file names)


Alternatively, what about adding the directory as an input to the recipe, like this (you'd also need to modify your combine_into_df function to use dir(work_dir) to generate the file names)?

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  plot_data:
    command: combine_into_df(work_dir = '1_fetch/in')

  figure_1.png:
    command: my_plot(plot_data)

After running scmake(), :point_up: this looks like it works, but results in a rather cryptic warning message might end up being ignored:

Warning messages:
1: In structure(.Call(C_Rmd5, files), names = files) :
  md5 failed on file '1_fetch/in'

An explaination of that warning message - In order to determine if file contents have changed, remake uses md5sum() from the tools package to create a unique hash for each target. You can take a look at what this does:

tools::md5sum('3_visualize/out/figure_1.png')
      3_visualize/out/figure_1.png 
"500a71be79d1e45457cdb2c31a03be46" 

md5sum() fails when you point it toward a directory, since it doesn't know what to do with that kind of input.


A third strategy might be to create a target that lists the contents of the directory, and then uses that list as an input:

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  work_files:
    command: dir('1_fetch/in')

  plot_data:
    command: combine_into_df(work_files)

  figure_1.png:
    command: my_plot(plot_data)

:point_up: This approach is close, but has a few flaws: 1) because '1_fetch/in' can't be tracked as a real target, changes within that directory aren't tracked (same issue as the previous example), and 2) if the contents of the files change, but the number and names of the files don't change, the work_files target won't appear changed, and therefore the necessary rebuild of plot_data would be skipped (another example of "skip the work you don't need" in action)

Instead, we would avocate for a modified approach that combines the work of md5sum with the collection of files returned with dir. If instead of work_files being a vector of file names (in this case, c('1_fetch/in/file1.csv','1_fetch/in/file2.csv','1_fetch/in/file3.csv')) - can we make it a data.frame of file names and their associated hashes?

# A tibble: 3 x 2
  filepath             hash                            
  <chr>                <chr>                           
1 1_fetch/in/file1.csv 530886e2b604d6d32df9ce2785c39672
2 1_fetch/in/file2.csv dc36b6dea28abc177b719015a18bccde
3 1_fetch/in/file3.csv c04f87fb9af74c4e2e2e77eed4ec80f3

Now, if the combine_into_df can by modified to accept this type of input (i.e., simply grab the filepath with $filepath), then we'll have an input that both reflects the total number of files in the directory, and captures a reference hash that will reveal any future changes. Yay! :star2: This works because a change to any one of those hashes will result in a change in the hash of the new modified work_files data.frame, which then would cause a rebuild of plot_data.

But unfortunately, because work_files still relies on a directory and therefore can't be tracked, we're still missing a critical part of the solution. In the next topic, we'll reveal a more elegant way to trigger rebuilds, but for now, note you'd need to manually run

scmake('work_files', force = TRUE) to force scipiper to dig into that directory and hash all of the files. If the result of work_files that comes back is the same as before, running scmake() won't build anything new and we can be confident that figure_1.png is up to date.


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

cnell-usgs commented 4 years ago

haha I think I tried almost all of these approaches in the previous issue

github-learning-lab[bot] commented 4 years ago

What to do when you want to specify a non-build-object input to a function?

Wow, we've gotten this far and haven't written a function that accepts anything other than an object target or a file target. I feel so constrained!

In reality, R functions have all kinds of other arguments, from logicals (TRUE/FALSE), to characters that specify which configurations to use.

The example in your working pipeline creates a figure, called 3_visualize/out/figure_1.png. Unless you've made a lot of modifications to the plot_nwis_timeseries() function, it has a few arguments that have default values, but aren't being used in the pipeline recipe, namely width = 12, height = 7, and units = 'in'. Nice, you can control your output plot size!

But adding those to the remake file like so


  3_visualize/out/figure_1.png:
    command: plot_nwis_timeseries(target_name, site_data_styled, width = 12, height = 7, units = 'in')

causes an immediate error

scmake()
 Error in .remake_add_targets_implied(obj) : 
  Implicitly created targets must all be files:
 - in: (in 3_visualize/out/figure_1.png) -- did you mean: site_data, 1_fetch/out/site_info.csv, 

Since 'in' is not an object target with a recipe in the pipeline, remake is trying to find the file corresponding to "in", since it must be a file if it isn't an object.

We know "in" is not a file, it is instead a simple argument we want to expose in the recipe, so we can make modification. To do this, wrapping the argument in the I() function tells remake to "treat this argument 'as is'", meaning don't try to infer anything fancy, just pass it along to the function.


  3_visualize/out/figure_1.png:
    command: plot_nwis_timeseries(target_name, site_data_styled, width = 12, height = 7, units = I('in'))

works! :star2:


Going back to our previous example where we wanted to build a data.frame of file hashes from the '1_fetch/in' directory, I() comes in handy in two spots: 1) we can avoid the md5sum warning by wrapping the directory path in the I(), which means '1_fetch/in' is used as a real string instead of attempting to treat it as a file target, and 2) we can add a new variable that can help us force or refresh the target with some information about when we last did it.

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  work_files:
    command: hash_dir_files(directory = I('1_fetch/in'), dummy = I('2020-05-18'))

By adding this dummy argument to our own custom hash_dir_files() function, we can modify the dummy contents any time we want to force the update of work_files. Making a change to the text in the dummy argument has the same outcome as scmake('work_files', force = TRUE), but this way we keep an easy to read record of when we last manually refreshed the pipeline's view of what exists within '1_fetch/in'. You may see the use of these dummy arguments in spots where there is no other trigger that would cause a rebuild (such as the case of these directory challenges or when pulling data from a remote webservice or website - scipiper has no way of knowing that new data are available on the same service URL).


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

cnell-usgs commented 4 years ago

amazing

github-learning-lab[bot] commented 4 years ago


When you are done poking around, check out the next issue.