Closed pbelmann closed 9 years ago
I think this is a good idea. In many cases a user will have a single standard file type which they wish to pass to a tool. For the assembler this could be as simple as:
bbx-assemble --in FASTQ --out FASTA
I've updated the title
I have implemented a prototype build for the biobox command line interface. This is written in python as we discussed. The bioboxes/command-line-interface github repo contains the source code. The basic interface is:
biobox <biobox_type> <container> [options]
The biobox type determines what the remaining command line arguments should be. For example the short read assembler biobox uses the -i and -o arguments.
biobox short_read_assember bioboxes/megahit -i reads.fq -o contigs.fa
@pbelmann I believe we could consolidate the existing validation tools into this command line interface. We provide something along the lines of:
biobox short_read_assember bioboxes/megahit --validate
This would run all the feature tests we have for a given biobox type. This would simplify validating new container types as we would not need to create a new tool for each type. I believe it would make creating biobox images easier for developers too. What do you think?
In addition we reviewers could use this tool when reviewing biobox images to make sure they are implemented as expected.
This is available on pypi now.
@pbelmann could you give this a try:
pip install --user biobox_cli
I would again like to propose a syntax which is similar to docker, git etc.
biobox run --engine docker --container bioboxes/velvet --specification biobox.yaml --arguments -i FASTQ -o CONTIGS
biobox validate --engine docker --container bioboxes/velvet --specification biobox.yaml
or in short with docker being the default
biobox run bioboxes/velvet -i FASTQ -o CONTIGS
biobox validate bioboxes/velvet
or using URI-style syntax instead
biobox run --container docker://bioboxes/velvet --specification biobox.yaml --arguments -i FASTQ -o CONTIGS
biobox validate --container docker://bioboxes/velvet --specification biobox.yaml
with the same short notation when docker is the default engine/backend. Supporting both versions would also be possible.
My preference at the moment is supporting different engines is not a
priority as currently Docker is the only one in mainstream use. Depending
if runC becomes popular, then I agree we should support different backends.
I would favour the --engine
syntax, with the default value being Docker,
and this flag is only required when using another engine. This what I
believe you suggest in your second example.
biobox run bioboxes/velvet -i FASTQ -o CONTIGS
I believe we need to scope the command by the type of biobox being used. Otherwise I think it will require more work to induce the type of biobox, and therefore the command line arguments, from the name of the image. For example:
biobox short_read_assembler bioboxes/velvet -i FASTQ -o CONTIGS
biobox validate --engine docker --container bioboxes/velvet --specification biobox.yaml
I should clarify there are two places where we use the term 'validate' in the bioboxes project and I think this has lead to confusion. These are:
Since this term is overloaded, I suggest we start using verify
for the
second point, and continue to use validate for the first. I welcome better
suggestions though.
I believe is that the biobox cli should include the functionality to verify an image follows the RFC, since we are already essentially doing this with mulitple different tools: the short-read-assembler-validator, binning-validator, and so forth. My suggested syntax is:
biobox short_read_assembler bioboxes/velvet --verify
biobox binning bioboxes/metabat --verify
I agree with @fungs wrt the git-like syntax. In addition, I am favoring the proposed solution where biobox types, if not specified, are automatically inferred from biobox names.
This could for example be accomplished by having a global registry that assigns bioboxes to biobox types. I have not been following the discussion lately, so I don't know if such a registry is planned anyways (which would make sense in order to find bioboxes in the first place).
Another advantage of not specifying biobox types when running a biobox is that associations of bioboxes with biobox types could be changed without breaking user's code (as long as the parameter interface remains stable of course).
I agree with @fungs and would favor a solution where biobox types, if not specified, are automatically inferred from biobox names.
This could for example be accomplished by having a global registry that assigns bioboxes to biobox types. I have not been following the discussion lately, so I don't know if such a registry is planned anyways (which would make sense in order to find bioboxes in the first place).
I am less keen on this solution because it requires making a web request to the registry each time the command line interface is called. To me it seems simpler to ask the user to specify the type on the command line.
Another advantage of not specifying biobox types when running a biobox is that associations of bioboxes with biobox types could be changed without breaking user's code (as long as the parameter interface remains stable of course).
Would an example of this be that the bioboxes/velvet image was changed from being an assembler to another function such as a short read aligner? If so I think this would be quite confusing for a use if we supported this happening.
Presently I believe the priority is to get people to get people using bioboxes. I think the best feedback will be from people using bioboxes and the CLI in production, and then we'll be able to prioritise new features.
Feedback and PRs are always welcome, however at least in my case I have a finite amount of time to contribute to bioboxes and it's not always possible to act on each suggestion.
To me it seems simpler to ask the user to specify the type on the command line.
I guess my point is that many users might not care about biobox types and just want to use an available biobox as simply and quickly as possible. So from that point of view it would make sense to hide this layer of complexity. But I am not sure how to best accomplish this (online and/or offline registry?) and of course I understand if this is currently not a priority. However, simplicity and ease of use can certainly be a strong motivation for many users to get started with bioboxes in the first place.
Would an example of this be that the bioboxes/velvet image was changed from being an assembler to another function such as a short read aligner?
I was rather thinking about e.g. renaming it from "assembler" to "short-read assembler" or vice versa. Biobox types are likely very much in flux, especially in the early days when the community has to decide what kind of biobox types are even out there. So if the association biobox <--> biobox type could somehow happen dynamically in the background it could be easier building up and maintaining a large biobox repository over time.
If so I think this would be quite confusing for a use if we supported this happening.
I think changing the biobox type is only confusing if it plays a prominent role in a user's usecase, which in many cases it might not (as pointed out in the first paragraph).
Presently I believe the priority is to get people to get people using bioboxes. I think the best feedback will be from people using bioboxes and the CLI in production, and then we'll be able to prioritise new features.
Totally agree.
I guess my point is that many users might not care about biobox types and just want to use an available biobox as simply and quickly as possible. So from that point of view it would make sense to hide this layer of complexity.
I agree, if the parameters could be inferred from the biobox then it could make the user interface. I just slightly hesitant of supporting this because as you mention this would require connecting to a registry. I would worry about opening a can of worms in terms of user support.
However, simplicity and ease of use can certainly be a strong motivation for many users to get started with bioboxes in the first place.
I totally agree, the issue for me that we don't have a large team of people working on bioboxes to implement all the features we would like. The number of already open issues on github is an example.
I was rather thinking about e.g. renaming it from "assembler" to "short-read assembler" or vice versa. Biobox types are likely very much in flux, especially in the early days when the community has to decide what kind of biobox types are even out there. So if the association biobox <--> biobox type could somehow happen dynamically in the background it could be easier building up and maintaining a large biobox repository over time.
Yes this is true. We do have version numbers in the biobox.yaml to enable us to backwards-support changing interfaces. As you say though, the biobox name is essentially fixed.
Command line syntax is just a matter of style but the git/docker syntax makes it easy to implement the different modes independently of each other, each having it's own sematics on the CLI.
I agree. I will aim to implement this on the already existing 0.1.0 release branch of the command line interfaces.
Command line arguments for the corresponding spec should either be automatically inferred (if possible), or there should be a second sister (yaml) file which specifies each command line parameter. Both can be done by priority, if auto-inferred options look too complicated. Another way would be to specify the option name right in the original spec but this would add non-essential information.
My opinion for the command line interface is that it should not support the entire specification for a biobox, instead this should be a much simplified interface for 90% of use cases. For example in genome assembly most assemblies are from a single FASTQ file. Other less common cases such as multiple FASTQ files with different insert sizes should be supported by the user writing their own biobox.yaml file which the CLI can alternatively accept.
My opinion for the command line interface is that it should not support the entire specification for a biobox, instead this should be a much simplified interface for 90% of use cases. For example in genome assembly most assemblies are from a single FASTQ file. Other less common cases such as multiple FASTQ files with different insert sizes should be supported by the user writing their own biobox.yaml file which the CLI can alternatively accept.
Let's see how far we get without compromising simplicity. I'm just very much against hand-coding parameters for each biobox type as python code. These should be placed in another config or together in the original spec YAML file as optional tags/variables. Having them together in one file makes it easy for users to quickly create their own specs with their own parameters in a few minutes but splitting them in separate files would clearly separate the essential from the non-essential information. I'm totally for hand-crafting nice parameter names for simple applications but only if there is a priority fallback method to handle the other cases and only if it can be done outside of the CLI core code. I will continue to work on these features unless someone has strong objections.
None of these suggested features would disallow the user to pass their own YAML file. It would even be possible to nicely mix these concepts, e.g. fill out common values in the YAML file and set the remaining ones with the CLI.
I will close this issue as we have implemented this is in the bioboxes/command-line-interface and this contains more issues and discussions.
We could make the usage of bioboxes easier by providing a wrapper script for every interface.
The basic workflow would be:
The benefit of such a script is that he would not have to create a yaml and would not have to care about mounting files to correct locations.
What do you think?