Proposal: Standardised bioinformatics objects and morphisms

michaelbarton commented 9 years ago

I have been thinking about this standard since we discussed it in Bielefeld two weeks ago. I am worried that we are not being ambitious enough. I believe that containers will have a large impact on how code is shared and used in bioinformatics. I think containers will become common irrespective of whether through Docker or another yet-to-emerge implementation. Given the impact that I believe containers are likely to have on sharing scientific software, we should consider including additional standardisation to solve other problems in bioinformatics.

For instance A genome assembler converts FASTQ into FASTA and we could describe this as a morphism from one type of bioinformatics object to another:

f : [Q] → [A]

Where f is a container, and Q is a FASTQ entry and A is a FASTA entry. This then represents the transformation of a list of FASTQ entries (reads) to a list of FASTA entries (contigs). A paired FASTQ assembler can then be described as a morphism similarly:

f : [(Q,Q)] → [A]

Here, a list of FASTQ tuples is converted to a list of FASTQ entries. Using this syntax, each container can list the morphisms that it provides using the same language. A different example would be a reference-free read binning container:

f : [Q]     → [(I,I)]
f : [(Q,Q)] → [(I,I)]

These two morphisms describe passing a list of FASTQ entries, or a list of paired FASTQ tuples and returning a list of 2-tuple identifiers: the read id and the OTU id. A reference-based binning container is the morphism:

f : [A], [(Q,Q)] → [(I,I)]

This is the same as a reference free container except that a list of reference FASTA objects are also given to the container.

Composing morphisms

I believe the advantage of using this syntax is that we can then describe all containers using a common language making it easier to understand the inputs for a container. We could then describe a container that performs the following:

c : (f : X → Y), (g : Y → Z) → g ∘ f

This is a container, that given two containers, returns a new container composing the two morphisms together. Suppose a concrete example:

# A container that processes paired FASTQ
f : [(Q,Q)] → [(Q,Q)]

# A container that assembles reads into contigs
g : [(Q,Q)] → [A]

# Create a container that preprocess paired reads and then assembles
C(f, g) = g ∘ f : [(Q,Q)] → [A]

Thus we create a container implementing an assembly pipeline. Furthermore we ensure that every step will flow into the next one as long as the morphism types match. Finally this creates a single container that we can share as easily as the containers we are already using.

Implementation

Implementing this would not require changing much of what we have already proposed. Instead of requiring each container behave as a certain 'type' and having to implement all the above morphisms, we can instead specify a list of morphisms and each container can list which of those they implement. The use of environment variables as arguments would then otherwise stay the same.

abremges commented 9 years ago

FYI -- I'll come back to this in early 2015, after the christmas break and vacation. /A

pbelmann commented 9 years ago

It is a good idea to use this formalism/syntax because it is way easier to understand what a container should expect as input and provide as output. If I understand you correctly we would describe the types and morphism rules and compose them finally to container specifications.

So our short-read genome assembler container would be defined as:

g : [(Q,Q)] → [A]
g : [Q] → [A]

If we make it that abstract than it might be difficult to specify a more concrete parameter type like CONT_FASTQ_FILE_LISTING or CONT_PAIRED_FASTQ_FILE_LISTING.

Should we describe it that way: g: [[Q]] -> [A] ? or should we define a type that describes the listing?

michaelbarton commented 9 years ago

Yes exactly Peter. Each container could specify exactly which morphisms it provides. I believe we could continue to use the same environment variables but tie their names each to a different morphism.

The use of objects and morphism would be a language for describing containers, will the environment variables would be the concrete implementation details the internal code would use. There may need to be an extra environment variable to specify the morphism though.

On 12/30, pbelmann wrote:

It is a good idea to use this formalism/syntax because it is way easier to understand what a container should expect as input and provide as output. If I understand you correctly we would describe the types and morphism rules and compose them finally to container specifications.

So our short-read genome assembler container would be defined as:
g : [(Q,Q)] → [A]
g : [Q] → [A]
If we make it that abstract than it might be difficult to specify a more concrete parameter type like CONT_FASTQ_FILE_LISTING or CONT_PAIRED_FASTQ_FILE_LISTING.

Should we describe it that way: g: [[Q]] -> [A] ? or should we define a type that describes the listing?

Reply to this email directly or view it on GitHub: https://github.com/michaelbarton/bioinformatics-container-rfc/issues/9#issuecomment-68367521

pbelmann commented 9 years ago

Yes exactly Peter. Each container could specify exactly which morphisms it provides. I believe we could continue to use the same environment variables but tie their names each to a different morphism.

Ok. Since we can use same environment variables I will add binning and later the profiling specifications like we did so far.

The use of objects and morphism would be a language for describing containers, will the environment variables would be the concrete implementation details the internal code would use. There may need to be an extra environment variable to specify the morphism though.

I would suggest we make our current rfc as version 1.0, in version 1.x we can transform everything to morphisms and objects as long as the environment variables stay the same. Finally I would suggest to solve the other issues like #8 #6 #5 #4 independendly of this proposal, so that we have a version 1.0.

fungs commented 9 years ago

I agree that a more formalized specification for in and output of containers would allow better for passing data and chaining containers. But I also believe that start with a stable version as more issues will pop up when the containers are actually used. Maybe go for a version below 1.0 to indicate that this specification draft is still going to be developed.

For the signature-based approach we need data types for everything which makes up in- and output. These could also involve an ontology which defines the terms and relates them hierarchically.

Johannes

abremges commented 9 years ago

Downside is, that this makes everything way more complicated to understand for laymen. If I were e.g. a biologist with only hands-on (bio)informatics training, there is no way I will understand such a formalism. I'd expect most tool developers to understand it, but even here I have my doubts.

Proposal: Go for morphisms, but add explanations in each case. Pretty much like @michaelbarton did in the post up here, some lines will suffice. Otherwise we will lose some people and their potentially valuable contributions.

michaelbarton commented 9 years ago

After discussion with @fungs and @pbelmann we have agreed that this is not a priority for the v0.8 release as this would require additional time to rewrite the RFC to match this.

I have added this to the tentative v2.0 milestone as implementing this would break backwards compatibility with v0.8 and therefore require incrementing the major release version following semantic versioning.

fungs commented 9 years ago

Before reinventing the wheel we should realize that passing data and arguments is a form of interprocess communication and that there are many solutions around that will help to select the best possible one. From my point of view, we have to find out how to best fit a structured approach using JSON or similar in the container framework and how to define it in a way which leaves it open to other or new containerization techniques and other platforms.

Maybe we should have a look at how web services and pipelining tools are currently doing this although this is in a slightly different context. I'd suggest to start by looking at taverna and its modules. Maybe we could even aim for network transparency, making every biobox a possible web application.

See http://www.myexperiment.org and you will understand how powerful this approach could be.

avilella commented 9 years ago

This probably has some overlap with the efforts of the Global Alliance project called common workflow language. I found the github URL the other day:

https://github.com/common-workflow-language/common-workflow-language

Here are some more notes on the project:

    - Common Workflow Language working group

    - The Common Workflow Language (CWL) is an informal,
      multi-vendor working group consisting of various
      organizations and individuals that have an interest in
      portability of data analysis workflows.  Our goal is to
      create specifications that enable data scientists to
      describe analysis tools and workflows that are powerful,
      easy to use, portable, and support reproducibility.  CWL can
      be used to describe workflows for a variety of problem areas
      including data-intensive science like bioinformatics,
      physics, and astronomy; and business analytics such as log
      analysis, data mining, and ETL.

    - https://groups.google.com/forum/#!forum/common-workflow-language

    -

https://github.com/common-workflow-language/common-workflow-language

    - The Global Alliance for Genomics Health (GA4GH) is an
      "international coalition, dedicated to improving human
      health by maximizing the potential of genomic medicine
      through effective and responsible data sharing." The GA4GH
      is in the process of starting up a working group on
      workflows and containers and this working group will likely
      pick up and adopt the efforts of the Common Workflow
      Language group and its efforts of to create standard, modern
      formats for describing tools and workflows for
      bioinformatics platforms. Implementing support for these
      formats in Galaxy would allow tool and workflow authors to
      produce artifacts that could be shared across Galaxy, Seven
      Bridges, and Arvados and potentially other platforms in the
      future such as Mobyle.

On Tue, Feb 24, 2015 at 9:44 AM, fungs notifications@github.com wrote:

Before reinventing the wheel we should realize that passing data and arguments is a form of interprocess communication and that there are many solutions around that will help to select the best possible one. From my point of view, we have to find out how to best fit a structured approach using JSON or similar in the container framework and how to define it in a way which leaves it open to other or new containerization techniques and other platforms.

Maybe we should have a look at how web services and pipelining tools are currently doing this although this is in a slightly different context. I'd suggest to start by looking at taverna and its modules. Maybe we could even aim for network transparency, making every biobox a possible web application.

— Reply to this email directly or view it on GitHub https://github.com/bioboxes/rfc/issues/9#issuecomment-75726332.

pbelmann commented 9 years ago

I think this should be a topic in the next meeting.

avilella commented 9 years ago

CWL mentiones taverna et al: https://groups.google.com/forum/#!topic/common-workflow-language/eGrNfpjuq2E

avilella commented 9 years ago

a mention to an already working cwltool for docker:

https://groups.google.com/forum/#!topic/common-workflow-language/A0JffijHipY

michaelbarton commented 9 years ago

This topic was reviewed in meeting #67. Type signatures are still relevant and are being discussed in issues #86 and #61.

michaelbarton commented 9 years ago

We have implemented signatures for the biobox RFCs. For example we have specified that the every short read assembler should provide the signature:

f : [fastq A], [Maybe insert_size A] → contigs B, scaffold C

This is indicates a biobox that takes a list of fastq files and a list of optional insert sizes. They both have the algebraic type A, indicating that that the entries correspond. E.g. for each fastq there maybe a corresponding insert size entry in the second list.

I think these should be the central method for describing what a biobox does because I think it contains all the information required in a succinct format. I believe we could also overlay a type system once we chain containers together, for instance if we build a workflow that uses the output of container A as the input for container B, it would be possible to check that the signatures match before executing the workflow. This would prevent the kinds of errors where a workflow breaks halfway through because of incompatible data types.

My thought on the biobox.yml file is that it is a way of providing input, where the format of the data should match the signature of the biobox. If the formats do not match the biobox should not be run.

michaelbarton commented 9 years ago

I will close this as I believe the signature formats in the RFCs implement this.

bioboxes / rfc

Proposal: Standardised bioinformatics objects and morphisms #9

Composing morphisms

Implementation