DataBiosphere / leonardo

Notebook service
BSD 3-Clause "New" or "Revised" License
44 stars 21 forks source link

User-customizable Docker images #414

Closed kyuksel closed 4 years ago

kyuksel commented 6 years ago

Allow users to specify custom Docker images to be used to launch notebooks.

Soo Hee Lee from the DSP Comms team: "We need to be able to run the Docker image broadinstitute/gatk:4.0.5.0 and run a Spark tool from the GATK toolkit."

rtitle commented 6 years ago

This is going to turn into a Q3 OKR.

rtitle commented 6 years ago

See this PR for a way to split up our Dockerfile into a base image + extension. The PR was not merged because our build didn't support it at the time, but it may be useful as a reference.

https://github.com/DataBiosphere/leonardo/pull/330

sooheelee commented 5 years ago

Just seeing that I'm being quoted. I'd like to be more specific about what we need. There are a number of GATK tools that have additional R or Python dependencies that are not straight forward to manually install, so we tell folks to use the GATK Docker. How can we easily enable researchers to use these tools in Leonardo?

Perhaps the solution is custom kernels.

The R dependencies are for certain GATK plotting tools, which are I think exactly the type of functionality we want to enable in Leonardo. Currently, these are the instructions we provide towards setting up this environment manually.

The Python environment is needed for GATK's newer machine learning algorithms, e.g. GermlineCNVCaller. For the Python environment, to set up locally, we tell folks to run the gatkcondaenv.yml script that comes packaged with GATK and then activate the environment with source activate. GATK instructs to run the script with conda. The contents of the yaml for GATK4.1.0.0 are as follows. How do we enable this in Leonardo? Or is Leonardo not really intended to support such functionality?

Thanks.

# Conda environment for GATK Python Tools
#
name: gatk
channels:
- defaults
dependencies:
- certifi=2016.2.28=py36_0
- intel-openmp=2018.0.0
- mkl=2018.0.1
- mkl-service=1.1.2
- openssl=1.0.2l=0
- pip=9.0.1=py36_1
- python=3.6.2=0
- readline=6.2=2
- setuptools=36.4.0=py36_1
- sqlite=3.13.0=0
- tk=8.5.18=0
- wheel=0.29.0=py36_0
- xz=5.2.3=0
- zlib=1.2.11=0
- pip:
  - biopython==1.70
  - bleach==1.5.0
  - cycler==0.10.0
  - enum34==1.1.6
  - h5py==2.7.1
  - html5lib==0.9999999
  - joblib==0.11
  - keras==2.2.0
  - markdown==2.6.9
  - matplotlib==2.1.0
  - numpy==1.13.3
  - pandas==0.21.0
  - patsy==0.4.1
  - protobuf==3.5.0.post1
  - pymc3==3.1
  - pyparsing==2.2.0
  - pysam==0.13
  - python-dateutil==2.6.1
  - pytz==2017.3
  - pyvcf==0.6.8
  - pyyaml==3.12
  - scikit-learn==0.19.1
  - scipy==1.0.0
  - six==1.11.0
  - tensorflow==1.9.0
  - theano==0.9.0
  - tqdm==4.19.4
  - werkzeug==0.12.2
  - gatkPythonPackageArchive.zip