elmeraa-usgs / ds-pipelines-targets-3

https://lab.github.com/USGS-R/many-task-pipelines-using-targets
0 stars 0 forks source link

Scale up #12

Closed github-learning-lab[bot] closed 2 years ago

github-learning-lab[bot] commented 2 years ago

Your pipeline is looking great, @elmeraa! It's time to put it through its paces and experience the benefits of a well-plumbed pipeline. The larger your pipeline becomes, the more useful are the tools you've learned in this course.

In this issue you will:

:keyboard: Activity: Check for targets udpates

Before you get started, make sure you have the most up-to-date version of targets:

packageVersion('targets')
## [1] ‘0.5.0.9002’

You should have package version >= 0.5.0.9002. If you don't, reinstall with:

remotes::install_github('ropensci/targets')
github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Switch to a new branch

Before you edit any code, create a local branch called "scale-up" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout main
git pull origin main
git checkout -b scale-up
git push -u origin scale-up

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main and sync with "origin" whenever you're transitioning between branches and/or PRs.


Comment on this issue once you've created and pushed the "scale-up" branch.

elmeraa-usgs commented 2 years ago

Commenting to continue on

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Include all the states

Expand states

Test

Comment on what you're seeing.


I'll respond when I see your comment.

elmeraa-usgs commented 2 years ago

i) 25% chance? ii) lots lol

github-learning-lab[bot] commented 2 years ago

:keyboard: Activity: Use fault tolerant approaches to running tar_make()

Rather than babysitting repeated tar_make() calls until all the states build, it's time to adapt our approach to running tar_make() when there are steps plagued by network failures. A lot of times, you just need to retry the download/upload again and it will work. This is not always the case though and sometimes, you need to address the failures. The targets package does not currently offer this fault tolerance, so the approaches discussed here are designed by our group to provide fault tolerance for tasks such as this data pull (including those where the "failures" are all real rather than being largely synthetic as in this project :wink:).

Understand your options

There are a few choices to consider when thinking about fault tolerance in pipelines and they can be separated into two categories - how you want the full pipeline to behave vs. how you want to handle individual targets.

Choices for handling errors in the full pipeline:

  1. You want the pipeline build to come to a grinding hault if any of the targets throw an error.
  2. You want to come back and rebuild the target that is failing but not let that stop other targets from building.

If you want the first approach, congrats! That's how the pipeline behaves by default and there is no need for you to change anything. If you want the pipeline to keep going but return to build that target later, you should add error = 'continue' to your tar_option_set() call.

Now let's talk about handling errors for individual targets. There are also a few ideas to consider here.

  1. If the target fails, you want that target to return no data and keep going.
  2. If the target fails, you want to retry building that target n times (in case of internet flakyness) before ultimately considering it a failed target.

If you want a failure to still be considered a completed build, then consider implementing tryCatch in your download/upload function to gracefully handle errors, return something (e.g. data.frame()) from the function, and allow the code to continue. If you want to retry a target before moving on in the pipeline, then we can use the function retry::retry(). This is a function from the retry package, which you may or may not have installed. Go ahead and check that you have this package before continuing.

Wrapping a target command with retry() will keep building that target until there are no errors OR until it runs out of max_tries. You can also set the when argument of retry() to declare what error message should initiate a rebuild (the input to when can be a regular expression).

Test

Commit

Comment once you've committed and pushed your changes to GitHub.


I'll respond when I see your comment.

elmeraa-usgs commented 2 years ago

Commenting to contine

github-learning-lab[bot] commented 2 years ago

You've just run a fully functioning pipeline with 212 unique branches (53 states x 4 steps)! Imagine if you had coded that with a for loop and just one of those states threw an error? :grimacing:

Now that you have your pipeline running with static branching, let's try to convert it into the other type of branching, dynamic.

:keyboard: Activity: Switch to dynamic branching

In our scenario, it doesn't matter too much whether we pick static or dynamic branching. Both can work for us. I chose to show you static first because inspecting and building pipelines with dynamic branching can be more mysterious. In dynamic branching, the naming given to each branch target is assigned randomly and dynamic branch targets do not appear in the diagram produced by tar_visnetwork(). But despite those inconveniences, dynamic branching is needed in many situations in order to build truly robust pipelines, so here we go ...

Convert to dynamic branching

image

Do you see the function combine_obs_tallies in the bottom left that is disconnected from the pipeline? There are a few ways to move forward knowing that something is disconnected: 1) fixing it because it should be connected, 2) leaving it knowing that you will need it in the future, or 3) deleting it because it is no longer needed. We will do the third - go ahead and delete that function. It exists in 2_process/src/tally_site_obs.R. Re-run tar_visnetwork(). It should no longer appear.

Test

states <- c('AL','AZ','AR','CA','CO','CT','DE','DC','FL','GA','ID','IL','IN','IA',
            'KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH',
            'NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX',
            'UT','VT','VA','WA','WV','WI','WY','AK','HI','GU','PR')

Commit

Once you've committed and pushed your changes to GitHub, comment about some of the differences you notice when running this pipeline using the dynamic branching approach vs our original static branching approach. Include a screenshot of the result in your viewer from your last tar_visnetwork() showing your dynamic branching pipeline.


I'll respond when I see your comment.

elmeraa-usgs commented 2 years ago

issue12capture

Noticed timeseries_png, and other targets build included the site number in the branch name

elmeraa-usgs commented 2 years ago

Commenting again to continue on....

cnell-usgs commented 2 years ago

This is fun, right? A strong test of code is whether it can be transferred to solve a slightly different problem, so now let's try applying this pipeline to water temperature data instead of discharge data.

Background: Multiple git branches

I'm about to ask you to do a few tricky/interesting things here with respect to git. Let's acknowledge them even though we won't explore them fully in this course:

The above notes are really just intended to raise your awareness about complicated things you can do with git and GitHub. When you encounter needs or situations like these in real projects, just remember to think before acting, and feel free to ask questions of your teammates to make sure you get the results you intend.

:keyboard: Activity: Repurpose the pipeline

Test

When everything has run successfully, use a comment to share the images from timeseries_KY.png and data_coverage.png. Take a second and peruse the other timeseries_*.png files. Did you find anything surprising? Include any observations you want to share about the build.


I'll respond when I see your comment.

elmeraa-usgs commented 2 years ago

timeseries_KY data_coverage

DE had a faulty gage and I'm assuming a new one was installed around 2007/2008. FL also has a gap in measurements around 2010. Some states also show a lot more interdecadal variability like NY.

cnell-usgs commented 2 years ago

That temperature data build should have worked pretty darn smoothly, with fault tolerance for those data pulls with simulated unreliability, a rebuild of everything that needed a rebuild, and console messages to inform you of progress through the pipeline. Yay!

I'll just share a few additional thoughts and then we'll wrap up.

Ruminations and tricks

Orphaned task-step files: You might have been disappointed to note that there's still a timeseries_VT.png hanging out in the repository even though VT is now excluded from the inventory and the summary maps. Worse still, that file shows discharge data! There's no way to use targets to discover and remove such orphaned artifacts of previous builds, because this is a file not connected to the new pipeline and targets doesn't even know that it exists. So it's a weakness of these pipelines that you may want to bear in mind, though this particular weakness has never caused big problems for us. Still, to avoid or fix it, you could:

  1. After building everything, sort the contents of 3_visualize/out by datestamp and manually remove the files older than your switch to parameter='00010'.
  2. Before you ever went to build the temperature version, you could have deleted all the files in the out folders. Then the new output files would get written into fresh, empty folders.

Forcing rebuilds of targets: One trick I thought I'd be sharing more of in this course is the use of the tar_invalidate(), which forces a rebuild of specified targets. This can seem necessary when you know that there has been a change but the pipeline is not detecting it. We've used this forced rebuild approach a lot in the past, but I can no longer think of a situation where it's truly necessary. The best use of tar_invalidate() is as a temporary bandaid or diagnostic tool rather than as a permanent part of your pipeline. Instead of forcing something to rebuild, you should determine the root cause of it being skipped and make sure pipeline is appropriately set up.

The one example where it may really feel necessary is when you want to force a complete redo of downloads from a web service. You could use tar_invalidate() for this, but in these pipelines courses we've also introduced you to the idea of a dummy argument to data-downloading functions that you can manually edit to trigger a rebuild. This is especially handy if you use the current datetime as the contents of that dummy variable, because then you have a git-committed record of when you last pulled the data. In our example project you might want a dummy variable that's shared between the inventory data pull and the calls to get_site_data().

Fetching results from the targets database: We've already used these functions in this course, but I want to share them again here to help you remember. They're just so handy! To access the current value of a target from your pipeline, just call

tar_load('oldest_active_sites')

or for fetching a file target,

summary_state_timeseries <- readr::read_csv(tar_read('summary_state_timeseries_csv'))

The nice thing about these functions are that they don't take time to rebuild or even check the currentness of a target. It just loads or passes the object to you.

Make a pull request

Phew, what a lot you've learned in this course! Let's get your work onto GitHub.


I'll respond when I see your PR.

cnell-usgs commented 2 years ago

Great, moving on!