GELOG / adamcloud

Portable cloud infrastructure for a genomic transformation pipeline using Adam
2 stars 0 forks source link

Docker image repeatability #18

Open davidonlaptop opened 9 years ago

davidonlaptop commented 9 years ago

Current setup

This diagram shows the inheritance between Docker images, and which images may use another image. ![image hierarchy](http://yuml.me/diagram/plain;dir:LR/class/[SNAP:1.0beta15]-^[Ubuntu:14.04.1], [ADAM:0.14]-^[Spark:1.1-bin-hadoop2.3], [ADAM:0.15]-^[Spark:1.2-bin-hadoop2.3], [Avocado:0.0-master-branch-s1.1-h2.3]-^[Spark:1.1-bin-hadoop2.3], [Avocado:0.0-master-branch-s1.2-h2.3]-^[Spark:1.2-bin-hadoop2.3], [Spark:1.1-bin-hadoop2.3]-^[Java:openjdk7], [Spark:1.1-bin-hadoop2.4]-^[Java:openjdk7], [Spark:1.2-bin-hadoop2.3]-^[Java:openjdk7], [Java:openjdk7]-^[Ubuntu:14.04.1], [Hadoop:2.3]-^[Java:openjdk7], [Hadoop:2.6]-^[Java:openjdk7], [Spark:1.1-bin-hadoop2.3]->[Hadoop:2.3], [Spark:1.2-bin-hadoop2.3]->[Hadoop:2.3])

Repeatability with Docker image

The issue

Some images are building from source (e.g. avocado builds from GitHub sources). This means that when we want to update the Dockerfile for an already published image on the Docker Hub, the resulting code may have changed beyond our control. This can cause unexpected behavior to the users of the image, and can perhaps even break user's pipeline.

Therefore, we established a set of rules below to provide better repeatability.

(As for configuration management systems, it would be really difficult to reach 100% repeatability guarantee, but good-enough guarantee seems acceptable.)

Rules to improve repeatability

  1. All images MUST be versioned using the encapsulated technology's version (the image for Hadoop 2.5.1 would be named hadoop:2.5.1.
  2. Semantic Versioning is assumed unless otherwise specified (e.g. the version number follows the MAJOR.MINOR.PATCH convention, where breaking changes are only allowed in MAJOR increment).
  3. Images with complete version number (e.g. 2.5.1) SHOULD NEVER change once published to the Docker Hub. End-users are encouraged to use these images for production use.
  4. Only image aliases (e.g. 2.5, 2) are allowed to change. At first hadoop:2, hadoop:2.5, and hadoop:2.5.1 may all point to the same image. When the image hadoop:2.5.2 is released, the images hadoop:2 and hadoop:2.5 are updated.
  5. No guarantee is made to provide a Docker image for all possible versions. Updates are made based on our needs or upon request.
  6. The folders in the git repo follows the MAJOR.MINOR version scheme. Therefore, the folder hadoop-2.5/ would contain the current latest release (e.g. Hadoop 2.5.1) available at image creation time.
  7. *Complete version numbers can be found via a GitHub tag ***. e.g. Hadoop 2.5.1's Dockerfile can be found at /tree/v2.5.1/hadoop-2.5/Dockerfile.
  8. An image with a complete version number MUST only inherits from another complete version. e.g. spark:1.3.0 can inherit from hadoop:2.3.0 but not hadoop:2.3 nor hadoop:2.
  9. The Dockerfile should specify the versions of all software that it install. If the Hadoop image requires maven to build, it should specify the maven version in a variable of the Dockerfile. As well as the Hadoop version.
  10. When possible, all images should inherit from the same based image. For now, it is Ubuntu 14.04.1.
  11. When possible, all Java-based software should use OpenJDK as OracleJDK cannot be distributed legally in a Docker image without the user accepting the license (see explaination from Docker team here).
  12. A special care should taken to produce the image size as small as possible. Refer to the advices for building official Docker images.

    Remaining issues (Questions for ADAM Team)

    Q1) Adam inherits from Spark, but Spark has multiple versions.

This can multiply the number of images to support. If there is 10 images of Spark, and 10 images of Adam, that makes 100 possible images to support. This will become quickly unmaintainable.

Is there a way to build a Docker image for Adam that will make it compatible with multiple Docker images of Spark ?

e.g. If Adam 0.15.0 image inherits from Spark 1.2.1 image, can it be used with a Spark cluster with a version less or equal to 1.2.1 ?

Q2) Spark has multiple platforms

Similar issue as above. A single release for Spark can be built for multiple platforms: Hadoop 2.3, Hadoop 2.4, Hadoop 2.6, MapR 3.x, MapR 4.x, Hadoop 1.x, CDH4, etc...

Supporting MapR, CDH4, Hadoop 1.x is separate issue, out of the scope of this analysis. This leaves us with 3 releases of Hadoop.

Is there a way to build a docker image for Spark that will make it compatible with multiple Docker images of Hadoop?

Otherwise, we need to build at least 3 images for each new version of Spark.