bioboxes / rfc

Request for comments on interchangeable bioinformatics containers
http://bioboxes.org
MIT License
40 stars 9 forks source link

Port biobox interfaces to CWL abstract interfaces #132

Open avilella opened 9 years ago

avilella commented 9 years ago

As discussed in the CWL conference call today, 28 April 2015, now it is probably a good time to exemplify the case of using an existing bioboxes container and run it within CWL.

For example, https://github.com/bioboxes/velvet has produced a docker image here: https://registry.hub.docker.com/u/bioboxes/velvet/

I would ask @ntijanic if he could show as an example, how CWL could use this container.

Mentioning @tetron here as well: most of what we discussed in the conference call about input folder, output folders, etc. is happening in the ad-hoc assemble script for this velvet container, which I guess would happen differently in CWL.

Thank you guys for your efforts

ghost commented 9 years ago

Hi, all.

First, to see if I understand what's happening with the image:

What can be done to describe these in CWL:

The above process should work for all bbx interfaces, just need a CWL template per interface.

Few possible issues:

If this makes sense, I'll write up the template and converter script.

avilella commented 9 years ago

Thanks @ntijanic for your comments.

If you could write, as you mention, the template and converter script, that would be great. I think working by example will clarify things for everybody.

When you say that tools are expected to produce files in cwd, what would make it easier for bioboxes to be consumed by CWL? I am sure we can find ways to tweak the way bioboxes are defined that will make it more intuitive for CWL to call bioboxes.

Cheers

On Wed, Apr 29, 2015 at 12:05 PM, Nebojsa Tijanic notifications@github.com wrote:

Hi, all.

First, to see if I understand what's happening with the image:

  • The tool is adapted via bash script to take arguments in the form specified with JSON-Schema in the YAML file.
  • Arguments are an array of two objects, each of which has an array of objects with id and value, nested under appropriate key (fastq or fragment).
  • Tool produces output file in a temp directory which is then copied to /bbx/output/bbx
  • Any other bioboxes short-read assembler box will work with the same interface.

What can be done to describe these in CWL:

  • Determine what the interface would look in CWL. Likely just an array of file + (optional) fragment size pairs.
  • Create a template CWL file that maps its input to bbx format and serializes to file before calling the process. Should be possible to use same template for all bbx short-read-assemblers.
  • Create a script that takes a template CWL file, a bbx image and some tool metadata (which might be in/on the docker image?) and produces a runnable CWL file for that tool.

The above process should work for all bbx interfaces, just need a CWL template per interface.

Few possible issues:

  • CWL does not prescribe output folder path; tools are expected to produce files in cwd. It seems bioboxes store all output in /bbx/output/bbx? Not sure how to work around this except by adding a feature to CWL to allow specifying where output folder should be mounted.
  • Most common way to specify outputs in CWL is using a glob pattern. This should be fine as long as bioboxes images produce consistent extensions on outputs. If they don't, the CWL equivalents would only be able to have a single output port.
  • There's no way to specify that docker entry point should be used as base command in CWL (since we don't want to limit CWL to docker). Easiest way around this is to read the entry point from the converter script.

If this makes sense, I'll write up the template and converter script.

— Reply to this email directly or view it on GitHub https://github.com/bioboxes/rfc/issues/132#issuecomment-97389141.

ghost commented 9 years ago

I wrote up a draft of what it could look like at https://gist.github.com/ntijanic/b0a75fe0a82639541278

The idea is that one would convert velvet by running:

./bbx2cwl.py cwl-bbx-assembly-template.yaml \
  https://registry.hub.docker.com/u/bioboxes/velvet/ \
  assemble Velvet > velvet.yaml

It's not actually runnable since some features are missing from cwl (and I didn't test any of it). Will come back and make it work when I find the time (likely during weekend).

avilella commented 9 years ago

Awesome, thanks @ntijanic . I think examples like this are very helpful.

Let's see what bioboxers @pbelmann and @michaelbarton say about it, and if we find ways to improve the bbx <-> cwl interoperativity from this example.

Cheers

On Wed, Apr 29, 2015 at 3:58 PM, Nebojsa Tijanic notifications@github.com wrote:

I wrote up a draft of what it could look like at https://gist.github.com/ntijanic/b0a75fe0a82639541278

The idea is that one would convert velvet by running:

./bbx2cwl.py cwl-bbx-assembly-template.yaml \ https://registry.hub.docker.com/u/bioboxes/velvet/ \ assemble \ Velvet

It's not actually runnable since some features are missing from cwl (and I didn't test any of it). Will come back and make it work when I find the time (likely during weekend).

— Reply to this email directly or view it on GitHub https://github.com/bioboxes/rfc/issues/132#issuecomment-97456912.

michaelbarton commented 9 years ago

@ntijanic thank you for taking the time to write a converter between bioboxes and CWL. I've looked at the gist you provided, however since I'm unfamiliar with using CWL I can't offer you any useful feedback here.

At the present we've been working on creating documentation so that developers can start submitting biobox assemblers. I think it would be ideal if bioboxes would fit into CWL so that anyone using CWL could take advantage of these existing bioboxes.

avilella commented 9 years ago

@ntijanic do you have any other recommendations of changes or additions to the way the bioboxes are described that would facilitate their use within CWL?

ghost commented 9 years ago

I still didn't get around to play with bbx much, but it all seems straightforward. It seems like it will be a very minor effort to include any bbx in a cwl workflow, if we supply a cwl template for each bbx-defined interface. Still, a few questions:

1) The assembler example has a single output file. Are there examples with multiple output files, possibly with different types (e.g. a QC tool that outputs a trimmed file and some reports)? This might be out of scope for bbx since no nice way to define a common interface for such cases.

2) How do you handle tools that require/produce index files?

One incompatibility I spotted was that CWL currently forces output to be in the cwd set for the process and AFAICT bbx expects it to be on a specific hard-coded path. We can add a requirement for cwd path in cwl to make this work.

Also, it seems bbx uses positional arguments in places where keyword args might be more appropriate, but that's just a matter of taste. We initially used JSON-Schema to define the interfaces too, but we switched to Apache Avro because it works nicer with JSON-LD/RDF.

pbelmann commented 9 years ago

... It seems like it will be a very minor effort to include any bbx in a cwl workflow, if we supply a cwl template for each bbx-defined interface. ...

That is good to hear.

1) The assembler example has a single output file. Are there examples with multiple output files, possibly with different types (e.g. a QC tool that outputs a trimmed file and some reports)? This might be out of scope for bbx since no nice way to define a common interface for such cases.

For multiple outputs you can take a look at the ray assembler.

https://github.com/bioboxes/ray

It produces one scaffold and one contig file, that is described in the output biobox.yaml. But that are not different types.

2) How do you handle tools that require/produce index files?

At the moment we don't have a definition/standard for handling index files.

One incompatibility I spotted was that CWL currently forces output to be in the cwd set for the process and AFAICT bbx expects it to be on a specific hard-coded path. We can add a requirement for cwd path in cwl to make this work.

If you want to start a biobox you have to mount your output directory to /bbx/output . Is that what you mean?

Also, it seems bbx uses positional arguments in places where keyword args might be more appropriate, but that's just a matter of taste. We initially used JSON-Schema to define the interfaces too, but we switched to Apache Avro because it works nicer with JSON-LD/RDF.

Could you provide an example?

I'm still interested in using Apache Avro for bioboxes and I still want to create a comparison between these libraries. That is one reason why issue #86 is still open.

ghost commented 9 years ago

If you want to start a biobox you have to mount your output directory to /bbx/output . Is that what you mean?

Yes. It shouldn't be a problem, I think. Looking at the ray example, the report-through-output-yaml-file approach should be compatible with CWL.

One possible issue: the paths of produced files in run.sh are hardcoded to "/ray/Contigs.fasta" and "/ray/Scaffolds.fasta". Shouldn't produced files be placed in the mounted output directory as well (/bbx/output)?

Could you provide an example?

    arguments: 
      type: "array"
      minItems: 1
      maxItems: 2
      items: 
        oneOf: 
          - 
            $ref: "#/definitions/fastq"
          - 
            $ref: "#/definitions/fragment"

It might make more sense for this to be a type: object with two fields, since JSON-Schema can't tell that the first item must be fastq and second fragment (reverse is also good according to this schema).

I'm still interested in using Apache Avro for bioboxes and I still want to create a comparison between these libraries

They seem to be structured differently. JSON-Schema is a set of assertions about the data, and Avro is more just for defining types. One issue with Avro is lack of additional constraints like cardinality or min/max/pattern.

pbelmann commented 9 years ago

Yes. It shouldn't be a problem, I think. Looking at the ray example, the report-through-output-yaml-file approach should be compatible with CWL.

Just out of curiosity: How does CWL report the output of a node in a workflow?

One possible issue: the paths of produced files in run.sh are hardcoded to "/ray/Contigs.fasta" and "/ray/Scaffolds.fasta". Shouldn't produced files be placed in the mounted output directory as well (/bbx/output)?

Yes thanks @ntijanic I updated ray.

It might make more sense for this to be a type: object with two fields, since JSON-Schema can't tell that the first item must be fastq and second fragment (reverse is also good according to this schema).

I know what you mean. At the moment you can define first the fragment size and then the fastq files which is not the nicest solution. Regarding the schema we had a very long discussion in #61. But I will mention this in the next bioboxes meeting.

ghost commented 9 years ago

Just out of curiosity: How does CWL report the output of a node in a workflow

The most common case is to use a glob pattern per output port. Depending on declared port type, it produces a file or array of files.

Another way, very similar to bbx, is for the docker process to produce a file called result.cwl.json which holds the values for each output port.

There is also a rarely used option (introduced mostly for compatibility with other description systems or to make it easier for tools to produce e.g. integers rather than files) where one can include an additional transformation after applying glob-based rules to produce something else.

This is what we'd be using for bbx conversion - specify glob: /bbx/output/biobox.yaml, then embed a transformation of file contents to actual files it points to :)

avilella commented 9 years ago

The ability to specify the output directory was mentioned in today's CWL conf call. Minutes here:

https://docs.google.com/document/d/1S8bMjX_KcjRs70K-60OVnxxRt8O9aCBPc74J_gtF-64/

michaelbarton commented 9 years ago

We've discussed the output directory recently. Now the output directory will be /bbx/output inside the container, which the user can mount anywhere. Similarly a new folder /bb/metadata can be optionally mounted and if so log files and additional run data will be placed here.

mr-c commented 7 years ago

I'm at a CAMI workshop today and we've succeeded in calling the MEGAHIT BioBox as a CWL Process.

https://github.com/mr-c/biobox-cwl-example/blob/master/biobox-megahit.cwl

Likewise we described MEGAHIT using CWL only:

https://github.com/mr-c/biobox-cwl-example/blob/master/megahit.cwl

We had to make a small change to the base bioboxed Docker container; a PR for that is at https://github.com/bioboxes/biobox-minimal-base/pull/3

michaelbarton commented 7 years ago

Thanks @mr-c, that looks great. We should probably add a tutorial related to this on the bioboxes website.

I'm curious as to what error you were getting for the binary permissions?

mr-c commented 7 years ago

@michaelbarton, you are very welcome!

We didn't get any user visible error due to the permissions problem, it took a lot of debugging to figure what the issue was (that the CWL reference implementation executes within Docker containers using the host user and not the root user and that the bioboxes utilities were installed so that only root could run them)

michaelbarton commented 7 years ago

We didn't get any user visible error due to the permissions problem, it took a lot of debugging to figure what the issue was (that the CWL reference implementation executes within Docker containers using the host user and not the root user and that the bioboxes utilities were installed so that only root could run them)

I see, that's useful to know. This could conceivably cause problems for other bioboxes, because only a subset of use the minimal base image. In addition to a tutorial for bioboxes+CWL, a trouble shooting page listing common problems such as this would probably be useful too.

mr-c commented 7 years ago

@michaelbarton Would you like to explore the possibility of rebasing BioBoxes using CWL's upcoming interface & abstract operations feature? https://github.com/common-workflow-language/common-workflow-language/issues/337

This would be a major shift, I know. No pressure, but wanted to explore the overlap between the projects :-)

Positives:

Cons:

Happy to chat about this in person at BOSC/ISMB with anyone

michaelbarton commented 7 years ago

Thank you for this suggestion @mr-c. The CWL abstract operation interface certainly appears to me to be almost identical to the bioboxes's signatures that we currently use to define biobiox types.

  • allows focus to return to building shared semantics

This is the most compelling argument for me, because the shared semantics are what have allowed the JGI and CAMI to collaborate on using the same biobox Docker images.

  • would allow the use of "generic" Docker containers
  • upgrade runtimes without having to rebuild containers

I'm less certain on these two points because I'm not familiar with them.

@pbelmann and I would have to discuss the pros and cons further, because a lot of my code is based on the biobox interface, and I believe the CAMI code base is likely to be the same.

A couple of other points I can think of would be we would be giving up a certain amount of control over the interfaces by subsuming biobioxes into CWL. For me personally, as I cannot speak for the others involved, bioboxes has been successful because the interfaces have been inflexible. By strictly limiting what can be passed as an argument to a biobox image and then standardising across all images of the same type it makes it trivial to run large numbers of different containers with the same interface. To put this another way if we lose "editorial" control over how interfaces were defined, we could end up in a situation where we try to suit everyone's needs and the project becomes less useful for me.

I am open to the idea, these are my initial concerns. @pbelmann and @asczyrba may have other points too. I think overall standardisation is better for the bioinformatics field than having multiple different interfaces.

mr-c commented 7 years ago

Thank you @michaelbarton for the thoughtful reply.

I think a new framing of the (valid) control issue might be helpful. To me, BioBoxes is a community that has agreed to co-create and co-maintain a set of shared interfaces for common functionality. This is not unlike other standardization or API communities. Currently this is accomplished via tooling specific to BBX and rules on how to use/structure a wildly popular "off the shelf" technology (Docker containers). You have control over the API, some of the tooling, but not the third party components.

The offer here is to continue the collective interface co-creation and maintenance using tooling that has a bigger set of users, and developers. Bioboxes would still be in control of what was an official BBX interface, using your own repository (on GitHub or wherever you choose). You'd gain the advantages listed above without giving up control of the interfaces. And you'd be very welcome to have your own tooling around the cwl-runner interface or to improve the CWL standards or any of the CWL implementations as well.

Another way of saying this: a future version of the CWL standards (1.2 or 2.0) will have the necessary interface & abstract operation features and likewise the reference implementation (and likely other implementations as well) would support these same features. The exact semantics of a biobox interface would still be 100% accomplished by this group using the method you currently grow and make decisions collectively.

Other advantages:

Of course this will require more reflection, testing, and trust that we'll continue to care and maintain the parts of CWL that you would use. Since there is so much in common already I personally find that unlikely. Fortunately at that point you would be a part of the CWL community and would have an equal say in its future direction.

TL;DR: I am very interested in helping bioboxes continue to be a useful thing, especially if CWL can assist with that (or grow to assist)

michaelbarton commented 7 years ago

@mr-c thank you for following up and taking the time to outline this. These are all very important points and have given myself and @pbelmann a great deal to think about. Both Peter and I have not had as much time as we would like to work on bioboxes, and moving the tooling and standardisation to CWL could help us to focus more on the reproducible benchmarking, which we is our current priority. I am currently in Bielefeld so we will spend the next few days discussing this.

michaelbarton commented 7 years ago

@mr-c, a quick follow up question. Could you imagine how we might build a reproducible benchmarking framework that combines nucleotid.es and CAMI using CWL. The original purpose of bioboxes was for our benchmarked containers to be the same ones in both CAMI and nucleotides. For example, are there ways to federate and collect metrics from workflows in different locations, or would this be a case of having a database step in the workflow?

Peter has mentioned something called ebench which is build on top of CWL.

mr-c commented 7 years ago

@michaelbarton Yes, that is eminently doable. For resource usage metrics (cpu, memory, disk, time) you can augment one of the CWL executors to gather that information from the underlying batch scheduler and output that as part of the provenance in a research object.

For more research oriented metrics (assembly quality, # unaligned reads, OTUs, etc..) I would suggest outputting them alongside the the other primary outputs and then afterwards you can combine, store, and transform that information.

pbelmann commented 7 years ago

Our initial version for a generic assembler interface

class: Operation
id: org.bioboxes.short_read_assembler
semantic: edam:op_0524
inputs:
  fastqs:
    label: interleaved paired & gzipped reads
    type: File[]
    format: edam:format_1931  # FASTQ
outputs:
  contigs:
    type: File
    format: edam:format_1929  # FASTA

According to issue https://github.com/common-workflow-language/common-workflow-language/issues/337

michaelbarton commented 7 years ago

@mr-c, here are some intial notes from this.

We use EDAM 1931 - FASTQ-Illumina for the input in the abstract interface. I believe we have to distinguish between FASTQ and FASTQ-illumina because long reads can also be in FASTQ format, but most assemblers can only handle short reads. We could probably need a different specification for hybrid assemblers. Regardless I think there should be EDAM terms for long read and short read FASTQ.

There is no EDAM specification for interleaved. I think an EDAM specification for this would also be useful because assemblers would need to know if the input data is is paired, what kind of paring (inward or outward), and if it is interleaved. The current biobox interface demands paired and interleaved data.

I don't see a specification for compression format. The biobox inputs are specified to be gzip files. I don't think anyone uses non-gzipped FASTQ but I think this should be explicit in the interface. Is this an EDAM term or a CWL term, e.g. GzipFile[].

There appears to be no specification for output format. We are using FASTA, but I think and EDAM term for assembled contig or scaffold FASTA would be useful too? These have semantic differences where you might expect scaffolds to have long stretches of Ns from joining contigs together.

How can we specify the different biobox "task" command bundles? Should each one be a new CWL file or is there a way to specify the task within the CWL file? These tasks are a way bundling a set of command line arguments in a reproducible way.

Summary:

michaelbarton commented 7 years ago

I've created an issue on the EDAM ontology related to the FASTQ formats, see the linked issue above.

michaelbarton commented 7 years ago

Here's a proposed abstract interface for reference-based assembly evaluation.

class: Operation
id: org.bioboxes.reference_based_assembly_evaluation
semantic: edam:op_3209 # Genome comparison
inputs:
  assembly_fasta:
    label: Assembled genome sequence FASTA files to be evaluated
    type: File[]
    format: edam:format_2200  # FASTA
  reference_fasta:
    label: Reference ground-truth genome sequence FASTA files
    type: File[]
    format: edam:format_2200  # FASTA
outputs:
  metrics:
    type: File
    format: edam:format_3475 # TSV key-value pairs of metrics
mr-c commented 7 years ago

These look great!

The interaction between compression, archives, and other data containers with an underlying data stream format is under-explored currently.

@stian Any thoughts?

As for the 'tasks', I recommend a separate CWL description for each task. You can follow a naming convention and use the label and doc fields to explain in as much detail as you desire.

For example:

megahit-nomercy.cwl

cwlVersion: v1.0
implementsOperation: https://specs.bioboxes.org/short_read_assembler
label: --no-mercy version of the Megahit short read assembler
class: Workflow

inputs:
  fastqs:
    label: interleaved paired & gzipped reads
    type: File[]
    format: edam:format_1931  # FASTQ

steps:
  megahit:
    run: https://github.com/mr-c/biobox-cwl-example/raw/master/megahit.cwl  # or relative path
    in:
      sequences: fastqs
      no_mercy: { default: True }
    out: [ megahit_contigs ]

outputs:
  contigs:
    type: File
    format: edam:format_1929  # FASTA
    outputSource: megahit/megahit_contigs

$namespaces:
  edam: http://edamontology.org/
$schemas:
 - http://edamontology.org/EDAM_1.18.owl
mr-c commented 4 years ago

FYI, class: Operation is going to be released as part of CWL v1.2 https://www.commonwl.org/v1.2.0-dev4/Workflow.html#Operation