Open charlesreid1 opened 6 years ago
dahak would take URLs for read files, but if they were present locally on the machine it would not use the URL or try to download them. Need to improve on that.
Simply imposing the requirement that the user specify a URL would exclude users who want to use taco on local data - not sensible. So perhaps we could have an option to take a URL, or take an absolute path, and if the data is not available at the absolute path the workflow fails (i.e. if the user wants to provide their data locally, they must ensure their data is available on the local machine where the snakemake task will run).
http://nih-data-commons.us/use-case-library/glossary/
Scientific objective: a description of a scientific process, told without reference to a specific technology solution. Focuses on resources required, method(s) followed, and outcome(s) targeted.
User epic: a story, told from the user's perspective that captures a valuable set of interactions with tools (e.g. software solutions) in the service of achieving the scientific objective.
User story: a specific piece of an epic that translates to a well-scoped and defined piece of software or data engineering work.
Scientific objective:
User epic:
(already somewhat naturally broken up into user stories)
(Background) The researcher does not have access to institutional HPC or cluster resources, but they do have a credit card. Therefore they would like to use cloud computing platforms to trim their reads. The researcher can estimate roughly how much time, memory, or compute resources should be used, but is not entirely certain. Similarly, the researcher knows that they need multiple compute nodes but does not know about networks or cloud architecture.
(Background) The researcher has a complex workflow with multiple programs with complex language and library dependencies, so they would like to use conda, and would also prefer to use docker or singularity if possible.
(Once) The researcher downloads and installs dahak-taco on their local machine. dahak-taco will allow them to run workflows that live in repositories that have dahak-taco wrappers.
(Once) The researcher downloads and installs dahak-bespin on their local machine. dahak-bespin will allow them to use the AWS API to automatically allocate cloud cluster resources, which will allow them to run dahak workflows.
(Once per workflow) The researcher downloads the workflow repository. This workflow repository will contain a dahak-taco wrapper and will define a set of workflow preferences and configuration options. The user can set these or leave them blank to use default values. The user does this using a JSON or YAML file.
(Once per workflow) The researcher should use dahak-bespin to start up, manage, and shut down a workflow cluster in the cloud. (Think of this like a background service that you start/stop/restart, where the service is the cloud cluster.) dahak-bespin does not arbitrarily create cluster after cluster, and leave them unmanaged and undestroyed - dahak-bespin only handles creating or destroying or managing one cluster at a time. All nodes in the cluster are set up to accept workflow tasks from dahak-taco.
POINT OF CLARIFICATION: running a workflow using multiple machines can take one of two forms:
(Once per workflow) When the user has configured their workflow and is ready to run it, they start up a workflow cluster in the cloud using dahak-bespin (e.g., something like, edit a bespin config file, and then do bespin start
). This gives the user a cluster in the cloud that is ready to accept workflows from dahak-taco. (Note that this is set up to do exactly one dahak-taco workflow per machine. There is no further distribution of tasks across machines.)
(Many times) The user runs a dahak-taco workflow. Each workflow is given a particular machine/worker node to utilize. The workflow contains instructions itself for how to gather results at the end. Snakemake handles this when the workflow is run.
POINT OF CLARIFICATION: currently not sure of the mechanism for passing info about nodes from dahak-bespin to dahak-taco.
Relates to test-driven development. Develop a user narrative for taco, and develop command line interface around that.
For example: I want to trim some reads. How do I create/use taco-read-filtering? Can I use a local config/params file? Do I need the config/params file in the repo?
What about data - what if my reads live locally? What if they're available in scratch space on the cluster? (If we target AWS as a platform, that simplifies things somewhat, as we're always assuming a green-field deployment and thus assuming read data is going to be remote/in a bucket. But we're trying to keep taco platform agnostic.)