bioboxes / rfc

Request for comments on interchangeable bioinformatics containers
http://bioboxes.org
MIT License
40 stars 9 forks source link

Bioboxes should provide a simplified user interface for using containers #152

Closed pbelmann closed 9 years ago

pbelmann commented 9 years ago

We could make the usage of bioboxes easier by providing a wrapper script for every interface.

The basic workflow would be:

  1. The user provides parameters to the script.
  2. The script creates a yaml and mount the files and folders to the correct location .
  3. Run the docker command.

The benefit of such a script is that he would not have to create a yaml and would not have to care about mounting files to correct locations.

What do you think?

michaelbarton commented 9 years ago

I think this is a good idea. In many cases a user will have a single standard file type which they wish to pass to a tool. For the assembler this could be as simple as:

bbx-assemble --in FASTQ --out FASTA
michaelbarton commented 9 years ago

I've updated the title

michaelbarton commented 9 years ago

I have implemented a prototype build for the biobox command line interface. This is written in python as we discussed. The bioboxes/command-line-interface github repo contains the source code. The basic interface is:

biobox <biobox_type> <container> [options]

The biobox type determines what the remaining command line arguments should be. For example the short read assembler biobox uses the -i and -o arguments.

biobox short_read_assember bioboxes/megahit -i reads.fq -o contigs.fa
michaelbarton commented 9 years ago

@pbelmann I believe we could consolidate the existing validation tools into this command line interface. We provide something along the lines of:

biobox short_read_assember bioboxes/megahit --validate

This would run all the feature tests we have for a given biobox type. This would simplify validating new container types as we would not need to create a new tool for each type. I believe it would make creating biobox images easier for developers too. What do you think?

michaelbarton commented 9 years ago

In addition we reviewers could use this tool when reviewing biobox images to make sure they are implemented as expected.

michaelbarton commented 9 years ago

This is available on pypi now.

@pbelmann could you give this a try:

pip install --user biobox_cli
fungs commented 9 years ago

I would again like to propose a syntax which is similar to docker, git etc.

biobox run --engine docker --container bioboxes/velvet --specification biobox.yaml --arguments -i FASTQ -o CONTIGS
biobox validate --engine docker --container bioboxes/velvet --specification biobox.yaml

or in short with docker being the default

biobox run bioboxes/velvet -i FASTQ -o CONTIGS
biobox validate bioboxes/velvet

or using URI-style syntax instead

biobox run --container docker://bioboxes/velvet --specification biobox.yaml --arguments -i FASTQ -o CONTIGS
biobox validate --container docker://bioboxes/velvet --specification biobox.yaml

with the same short notation when docker is the default engine/backend. Supporting both versions would also be possible.

michaelbarton commented 9 years ago

Backend

My preference at the moment is supporting different engines is not a priority as currently Docker is the only one in mainstream use. Depending if runC becomes popular, then I agree we should support different backends. I would favour the --engine syntax, with the default value being Docker, and this flag is only required when using another engine. This what I believe you suggest in your second example.

biobox run bioboxes/velvet -i FASTQ -o CONTIGS

I believe we need to scope the command by the type of biobox being used. Otherwise I think it will require more work to induce the type of biobox, and therefore the command line arguments, from the name of the image. For example:

biobox short_read_assembler bioboxes/velvet -i FASTQ -o CONTIGS

Validation/verification

biobox validate --engine docker --container bioboxes/velvet --specification biobox.yaml

I should clarify there are two places where we use the term 'validate' in the bioboxes project and I think this has lead to confusion. These are:

Since this term is overloaded, I suggest we start using verify for the second point, and continue to use validate for the first. I welcome better suggestions though.

I believe is that the biobox cli should include the functionality to verify an image follows the RFC, since we are already essentially doing this with mulitple different tools: the short-read-assembler-validator, binning-validator, and so forth. My suggested syntax is:

biobox short_read_assembler bioboxes/velvet --verify
biobox binning bioboxes/metabat --verify
Gig77 commented 9 years ago

I agree with @fungs wrt the git-like syntax. In addition, I am favoring the proposed solution where biobox types, if not specified, are automatically inferred from biobox names.

This could for example be accomplished by having a global registry that assigns bioboxes to biobox types. I have not been following the discussion lately, so I don't know if such a registry is planned anyways (which would make sense in order to find bioboxes in the first place).

Another advantage of not specifying biobox types when running a biobox is that associations of bioboxes with biobox types could be changed without breaking user's code (as long as the parameter interface remains stable of course).

michaelbarton commented 9 years ago

I agree with @fungs and would favor a solution where biobox types, if not specified, are automatically inferred from biobox names.

This could for example be accomplished by having a global registry that assigns bioboxes to biobox types. I have not been following the discussion lately, so I don't know if such a registry is planned anyways (which would make sense in order to find bioboxes in the first place).

I am less keen on this solution because it requires making a web request to the registry each time the command line interface is called. To me it seems simpler to ask the user to specify the type on the command line.

Another advantage of not specifying biobox types when running a biobox is that associations of bioboxes with biobox types could be changed without breaking user's code (as long as the parameter interface remains stable of course).

Would an example of this be that the bioboxes/velvet image was changed from being an assembler to another function such as a short read aligner? If so I think this would be quite confusing for a use if we supported this happening.

Presently I believe the priority is to get people to get people using bioboxes. I think the best feedback will be from people using bioboxes and the CLI in production, and then we'll be able to prioritise new features.

Feedback and PRs are always welcome, however at least in my case I have a finite amount of time to contribute to bioboxes and it's not always possible to act on each suggestion.

Gig77 commented 9 years ago

To me it seems simpler to ask the user to specify the type on the command line.

I guess my point is that many users might not care about biobox types and just want to use an available biobox as simply and quickly as possible. So from that point of view it would make sense to hide this layer of complexity. But I am not sure how to best accomplish this (online and/or offline registry?) and of course I understand if this is currently not a priority. However, simplicity and ease of use can certainly be a strong motivation for many users to get started with bioboxes in the first place.

Would an example of this be that the bioboxes/velvet image was changed from being an assembler to another function such as a short read aligner?

I was rather thinking about e.g. renaming it from "assembler" to "short-read assembler" or vice versa. Biobox types are likely very much in flux, especially in the early days when the community has to decide what kind of biobox types are even out there. So if the association biobox <--> biobox type could somehow happen dynamically in the background it could be easier building up and maintaining a large biobox repository over time.

If so I think this would be quite confusing for a use if we supported this happening.

I think changing the biobox type is only confusing if it plays a prominent role in a user's usecase, which in many cases it might not (as pointed out in the first paragraph).

Presently I believe the priority is to get people to get people using bioboxes. I think the best feedback will be from people using bioboxes and the CLI in production, and then we'll be able to prioritise new features.

Totally agree.

fungs commented 9 years ago
  1. Command line syntax is just a matter of style but the git/docker syntax makes it easy to implement the different modes independently of each other, each having it's own sematics on the CLI.
  2. I would always think of biobox names as an alias to the corresponding specification (YAML) so it is equivalent to pass a YAML or a name. For now, the easiest solution I can think of is to have a config folder, e.g. $HOME/.bioboxes/spec/ or so and you can drop the spec as short-read-assembler.yaml and it will automatically be read. This is also the most flexible solution I can think.
  3. Command line arguments for the corresponding spec should either be automatically inferred (if possible), or there should be a second sister (yaml) file which specifies each command line parameter. Both can be done by priority, if auto-inferred options look too complicated. Another way would be to specify the option name right in the original spec but this would add non-essential information.
michaelbarton commented 9 years ago

I guess my point is that many users might not care about biobox types and just want to use an available biobox as simply and quickly as possible. So from that point of view it would make sense to hide this layer of complexity.

I agree, if the parameters could be inferred from the biobox then it could make the user interface. I just slightly hesitant of supporting this because as you mention this would require connecting to a registry. I would worry about opening a can of worms in terms of user support.

However, simplicity and ease of use can certainly be a strong motivation for many users to get started with bioboxes in the first place.

I totally agree, the issue for me that we don't have a large team of people working on bioboxes to implement all the features we would like. The number of already open issues on github is an example.

I was rather thinking about e.g. renaming it from "assembler" to "short-read assembler" or vice versa. Biobox types are likely very much in flux, especially in the early days when the community has to decide what kind of biobox types are even out there. So if the association biobox <--> biobox type could somehow happen dynamically in the background it could be easier building up and maintaining a large biobox repository over time.

Yes this is true. We do have version numbers in the biobox.yaml to enable us to backwards-support changing interfaces. As you say though, the biobox name is essentially fixed.

michaelbarton commented 9 years ago

Command line syntax is just a matter of style but the git/docker syntax makes it easy to implement the different modes independently of each other, each having it's own sematics on the CLI.

I agree. I will aim to implement this on the already existing 0.1.0 release branch of the command line interfaces.

Command line arguments for the corresponding spec should either be automatically inferred (if possible), or there should be a second sister (yaml) file which specifies each command line parameter. Both can be done by priority, if auto-inferred options look too complicated. Another way would be to specify the option name right in the original spec but this would add non-essential information.

My opinion for the command line interface is that it should not support the entire specification for a biobox, instead this should be a much simplified interface for 90% of use cases. For example in genome assembly most assemblies are from a single FASTQ file. Other less common cases such as multiple FASTQ files with different insert sizes should be supported by the user writing their own biobox.yaml file which the CLI can alternatively accept.

fungs commented 9 years ago

My opinion for the command line interface is that it should not support the entire specification for a biobox, instead this should be a much simplified interface for 90% of use cases. For example in genome assembly most assemblies are from a single FASTQ file. Other less common cases such as multiple FASTQ files with different insert sizes should be supported by the user writing their own biobox.yaml file which the CLI can alternatively accept.

Let's see how far we get without compromising simplicity. I'm just very much against hand-coding parameters for each biobox type as python code. These should be placed in another config or together in the original spec YAML file as optional tags/variables. Having them together in one file makes it easy for users to quickly create their own specs with their own parameters in a few minutes but splitting them in separate files would clearly separate the essential from the non-essential information. I'm totally for hand-crafting nice parameter names for simple applications but only if there is a priority fallback method to handle the other cases and only if it can be done outside of the CLI core code. I will continue to work on these features unless someone has strong objections.

None of these suggested features would disallow the user to pass their own YAML file. It would even be possible to nicely mix these concepts, e.g. fill out common values in the YAML file and set the remaining ones with the CLI.

michaelbarton commented 9 years ago

I will close this issue as we have implemented this is in the bioboxes/command-line-interface and this contains more issues and discussions.