dib-lab / 2020-workflows-paper

Strategies for leveraging workflow systems to streamline large-scale biological analyses
https://dib-lab.github.io/2020-workflows-paper
Other
6 stars 8 forks source link

Address ctb comments in #25 #34

Closed bluegenes closed 4 years ago

bluegenes commented 4 years ago

Attempts to address the following, from #25:

  • [x] one sentence: highlight value of workflows supporting cloud execution now and in the future
  • [x] Integrate titus ggg298 Why use a workflow manager particularly worrying about steps invisibly failing
  • [x] re: ryan convo: " your local HPC might be unprepared for sophisticated massive users (the Lisa effect)"
  • [x] a lot of bioinformatics is converting between formats and conveying info between tools. Maybe mention that "decisions are made at every level of compute, and even if workflows “just” cobble together other software in a simple way, there are lots of implicit assumptions made there."
  • [x] known knowns and known unknowns are possible to evaluate fairly rigorously, I think, because you have a good idea of what to look for. there’s still some guesswork involved because you have to focus in on a sensible range of parameters and who knows what “sensible”?
  • [x] unknown unknowns, like the unintended consequences of filtering for metadata and joins across different programs, are MUCH harder to track down and evaluate. well, and also the impact of bugs from the software you’re using as well as your own pipeline.
  • [x] "There’s a massive difference between production workflows that can be run at scale and that almost never fail without a useful error message, vs research workflows that are run on a dozen samples and can have edge cases. Often the edge cases in research pipelines clue you into where interesting stuff is, either technically tricky OR biologically weird."
  • [x] Conclusion, maybe discuss:
    • "I think there is a new breed of biologist/bioinformatician coming along (echoing something Mick Watson said on Twitter). These workflow-enabled biologists will become increasingly valuable as data set size and complexity increases, along with the associated tool chain. Very few people are training them (waves hi! :)""