bioboxes / rfc

Request for comments on interchangeable bioinformatics containers
http://bioboxes.org
MIT License
40 stars 9 forks source link

Decouple biobox implementations and input validation #163

Open fungs opened 9 years ago

fungs commented 9 years ago

Hi @michaelbarton and @pbelmann,

I have started to build a base Debian image to speed up and reduce the code needed for biobox implementations. Thereby, I'm now strongly favoring to separate the input validation from the actual passing and processing. There are multiple reasons why this would be beneficial:

  1. Validation code needs to be replicated in every biobox, updates of the code need to be propagated to each implementation.
  2. Validation code messes up the Dockerfiles and make the images more complicated with more dependencies (also construction-time dependencies like internet access).
  3. If you pass the same input to serveral bioboxes which share the same YAML input specification, the same check is done twice.

All of those points could easily be circumvented by providing a single container image which would validate the input. One could provide one image per schema with deep inspection capabilities (like file format checks) or one general image.

The magic then happens in our bioboxes run wrapper which would call the validation container prior to running the actual biobox, if that is the desired behavior. Using this design, any biobox can assume to get correct input and restrict itself to a simple YAML parser.

fungs commented 9 years ago

This is directly related to #131.

It also means that providing an independent reliable distribution channel for the validator binaries or deb package is less important since we can directly use the DockerHub.

michaelbarton commented 9 years ago
  1. Validation code needs to be replicated in every biobox, updates of the code need to be propagated to each implementation.

I believe using apt can solve this, as when the image is rebuilt the latest version will be installed.

  1. Validation code messes up the Docker files and make the images more complicated with more dependencies (also construction-time dependencies like internet access).

I agree. I think the Dockerfiles have boilerplate code that confuses what each step being taken in. Either apt or base images are ways to solve this.

  1. If you pass the same input to serveral bioboxes which share the same YAML input specification, the same check is done twice.

Could you expand this point further?

The magic then happens in our bioboxes run wrapper which would call the validation container prior to running the actual biobox, if that is the desired behavior. Using this design, any biobox can assume to get correct input and restrict itself to a simple YAML parser.

I think you are suggesting a wrapper script that runs the validation scripts before running a developer defined script. This could be helpful. This could be the ENTRYPOINT in the Dockerfile.

fungs commented 9 years ago

I believe using apt can solve this, as when the image is rebuilt the latest version will be installed.

That's one approach but this way we still need to maintain an apt repository (overhead + network requirement). Since each biobox will have an independent version of the validation program, the versions will desynchronize relative to the built time of the containers. Therefore, we will not be directly able to push updates to the users of the biobox without altering or rebuilding individual containers. My suggestion would deliver the latest validation code to each biobox user by using our main distribution channel and technology: the Docker registry. Therefore, it should have a higher reliability and fewer dependencies.

  1. If you pass the same input to serveral bioboxes which share the same YAML input specification, the same check is done twice.

Could you expand this point further?

If you have one input which needs to be validated, say a read library for assembly, it is guaranteed to be valid if the validator confirms validity. Then, it can be passed to any assembler biobox which accepts this kind of input. By integration of the validation program into the biobox, each assembly biobox would re-check the input. This is apparently not necessary.

I think you are suggesting a wrapper script that runs the validation scripts before running a developer defined script. This could be helpful. This could be the ENTRYPOINT in the Dockerfile.

No, in fact I mean to run an independent validation container before running the actual biobox. This would simply the biobox implementation by the separation of the validation and execution logic.