DOI-USGS / scipiper

https://usgs-r.github.io/scipiper/index.html
Creative Commons Zero v1.0 Universal
9 stars 12 forks source link

how about a "what the heck, combine this unfinished stuff anyways" arg to loop_tasks? #139

Open jordansread opened 4 years ago

jordansread commented 4 years ago

Sometimes we use loop_tasks for the fault tolerance of brittle function calls, such as web services that may blip temporarily or downloads that get interrupted. But other times we use it because we aren't confident that all of the calls are valid. In the case of the latter, sometimes we want to complete the final combiner target(s) anyhow. Here is an example:

I may be running thousands of models, which I tie together in the end with combine_to_ind. This final target (a hash table yaml, in this case) is my way of tracking completion, and is much more reliable than calls to dir() and extract() to try to match file names or "completed within x days of" to grab a bunch of results.

So what do I do when one of these models fails because it can't complete and will never complete (e.g., a lake that dries up or has incomplete params)? A couple of options: 1) I manually delete the target from the combiner step on the auto-created task_remakefile (bad because this isn't reproducible), 2) pre-filter out the target(s) that would fail in the task_plan creation or args going into that (works in most cases, but does require a "stopped pipeline" and a re-coding after).

Alternatively, sometimes we know some analyses aren't going to work or some models aren't going to run and we could use the task failures to filter them out so they wouldn't be included later (I think this happened with a small subset of metab models?). Then exported_model_ids is something that you snag with the combiner indicator file instead of something that goes into the task_plan...

Perhaps finish_greedy or something as a logical into loop_tasks? If using this, you'd probably have a small number of retries (maybe only 1).

jordansread commented 4 years ago

And to amend what I said above ☝️, I can't only edit the auto-created makefile to remove the target to combine, I also have to remove that target from the task_plan. (the manual edit works w/ scmake(), but not loop_tasks, since it uses the task_plan until all tasks are complete)

aappling-usgs commented 4 years ago

That's a compelling use case - I've hit that too. I do have some questions about how to best satisfy this need.

  1. Do we want scipiper to think of the combiner targets as complete if they only include a subset of the tasks? If so, this would bring a new twist to the idea of reproducibility - some targets are expected to reproducibly fail, and downstream targets are "up to date" in the sense that some of their upstream targets have been intentionally bypassed. (I am open to considering this twist but want to think carefully about the consequences, because it feels like a big philosophical leap.)

  2. Do we only want this option to work once loop_tasks has fully completed, or would we like to be able to run the combiner even when we've only managed to make it through a first pass on some of the targets?

  3. Should the combiners be passed all of the targets (with either NA or NULL for the incomplete ones), or should they only be passed those targets that are complete? If we passed in all the targets, that would put the onus on the combiner functions to be able to handle incomplete targets gracefully...but in the case of wanting to identify which models failed, passing in the failed targets might be quite useful.

  4. When the use case is for figuring out which models have run and which failed, does scipiper::get_remake_status('my_combiner_target', 'xx_tasks.yml') already meet this need?

  5. Would there be any value in implementing this as a separate function, named something like combine_complete or combine_tasks, that constructs the arguments for the [completed] targets only and passes them to the combiner[s]? I think making it a separate function would be most appropriate if our answer to (1) is "no, keep this outside the pipeline", our answer to (2) is "yes, even after a first pass on just some of the targets", and/or our answer to (4) is "no, we want to be able to combine completed targets in customized ways even when building outside the pipeline". But making it an option to loop_tasks is most appropriate if our main goal is to combine subsets of the tasks within the pipeline (i.e., answer to (1) is "yes, the combination of an incomplete task set should be "complete" in scipiper's eyes").

jordansread commented 4 years ago

Agreed. When I was thinking this through I thought quickly about the issues, including the need for some kind of summary target that includes details on the skipped targets...

  1. Do we want scipiper to think of the combiner targets as complete if they only include a subset of the tasks?, yes, but see my comment above on the likely need to create some kind of "failed targets" branch in the workflow that would be retained even if ignored...

  2. _Do we only want this option to work once looptasks has fully completed? I thought this would be done after loop_tasks has failed on all of its tries.

3-5. More questions need more thought and discussion than I can provide now, but wanted to share my initial musings on the above since this issue isn't a simple one.

aappling-usgs commented 4 years ago
  1. Do we want scipiper to think of the combiner targets as complete if they only include a subset of the tasks?

yes, but see my comment above on the likely need to create some kind of "failed targets" branch in the workflow that would be retained even if ignored...

Sorry, I've just reread your text above a couple of times and still can't spot your comment on the likely need for a separate branch. Could you copy-paste it here?

  1. Do we only want this option to work once loop_tasks has fully completed?

I thought this would be done after loop_tasks has failed on all of its tries.

OK, good to know. I do see a use case for summarizing targets even before loop_tasks has attempted all of them - specifically when it's a struggle to get through the whole loop b/c the tasks are so numerous or so riddled with issues - but this use case is more along the lines of a get_remake_status need than a "create combiner to move on through the pipeline" need.

3-5. More questions need more thought and discussion than I can provide now, but wanted to share my initial musings on the above since this issue isn't a simple one.

Thanks for your initial thoughts.

jordansread commented 4 years ago

whelp. I don't think I actually sent my comment on what to do with the failed targets, so I don't think there is anything to point up to. I didn't mean "branch" in the git sense, just in the target sense (e.g., you have one combined set of targets for "success" and then another for "fails", so there is a bit of a "branch" there...

aappling-usgs commented 4 years ago

Oh, interesting idea. Would you use the same combiner twice, or would there be a special combiner for failures?