OpenRiskNet / home

OpenRiskNet wiki and core resources
14 stars 9 forks source link

Create Notebook image for RDKit #16

Closed tdudgeon closed 5 years ago

tdudgeon commented 6 years ago

We need an image that allows to run RDKit. Possibly this can extend the current SciPy image, but I think that contains lots of modules so it might be better to start from a basic image and just add what's needed?

danyx23 commented 6 years ago

I looked into this now in some detail. There are two ways of using jupyter images. One is the source2image way outlined here: and the other is using the base jupyterhub images with modifications and some config changes (regarding the unix groupd id used and increasing the timeout for pulling images - see https://github.com/jupyter-on-openshift/jupyterhub-quickstart#using-the-jupyter-project-notebook-images fore more details).

For getting rdkit working, I would strongly argue for not going the source2image way. Source2image tries to keep image sizes smaller by not installing conda, but installing rdkit in a container without conda is going to be a lot more work than just doing a conda install rdkit as part of a dockerfile if conda is available. I also think that the image size is a bit annoying but not a knockout criterium as in the motivating example of source2image where images above 2gb or so can't be used in the free openshift demo installation. (see https://github.com/jupyter-on-openshift/jupyter-notebooks#why-not-use-jupyter-project-images). I would therefore suggest to make the necessary config changes and prepare an image on top of one of the official jupyter docker images.

tdudgeon commented 5 years ago

I do have concerns about image size. It's the cause of some of the problems we have with Jupyter notebook images not starting reliably. But I also agree that installing RDKit with Conda would make things easier.

Possibly one approach is to create a S2I image that contains miniconda rather than conda. That way we can still install RDKit and its dependencies, but not have half the internet that a conda install puts there.

Another possible approach is to start from our RDKit base images which are designed to be small and follow best practices. See here for the GitHub project and here for the particular image that could be used which would require just the Jupyter components adding.

If we do need to go the route of huge kitchen sink images we probably need to pre-pull those images to all the nodes to limit the problems related to image size.

danyx23 commented 5 years ago

There is now an example dockerfile with rdkit: openshift/deployments/jupyter/notebook-containers/base-notebook-with-rdkit.Dockerfile

A docker container built from this image has been pushed to docker (douglasconnect/orn-jupyter-rdkit). This image is done in a naive way and is about 2gb in size but the method is very easy to replicate for other images.

The image has been added as a possible selection in the notebook image selector of our jupyterhub. Unfortunately, the image creation timeout that should be customizeable does not react to configuration and the image currently is not pulled successfully within 3 minutes (the default timeout).

There are two possible next steps: Build the image within the ORN cluster with jenkins and push it to the internal docker registry since this one should be able to stream the image in under 3 minutes; or investigate why increasing the timeout in the configuration doesn't have any effect.

tdudgeon commented 5 years ago

A RDKit notebook image is now deployed to the prod site. This is how it is built: https://github.com/OpenRiskNet/jupyter-notebook-images/tree/master/simple-rdkit