StaPH-B / docker-builds

:package: :whale: Dockerfiles and documentation on tools for public health bioinformatics
GNU General Public License v3.0
187 stars 119 forks source link

Add Dorado 0.8.0 #1051

Closed fraser-combe closed 2 weeks ago

fraser-combe commented 4 weeks ago

This pull request adds a Dockerfile and accompanying README for Dorado version 0.8.0.

Dorado is a high-performance, open-source tool developed by Oxford Nanopore Technologies for basecalling Oxford Nanopore sequencing data. It supports both CPU and GPU acceleration, providing rapid and accurate basecalling of Fast5/Pod5 files.

Key features of this Docker image:

Dorado Version 0.8.0: Includes the latest stable release of Dorado. Pre-downloaded Basecalling Models: All necessary basecalling models are downloaded during the build process and stored in /dorado_models. Sample Pod5 Test File: A sample Pod5 file is included for testing purposes. Internal Test Stage: The Dockerfile includes an internal test stage that runs during the build process to ensure Dorado is installed correctly and functioning as expected. NVIDIA CUDA Support: Based on the NVIDIA CUDA 12.2.0 base image to enable GPU acceleration (requires an NVIDIA GPU and the NVIDIA Container Toolkit).

The README.md for Dorado 0.8.0 is longer than 30 lines because it includes detailed instructions and explanations necessary for users to effectively build, run, and test the Docker image. Given that Dorado utilizes GPU acceleration, it's important to provide comprehensive guidance.

Demonstration of GPU Functionality I have tested the Dorado 0.8.0 Docker image on a GPU-enabled Linux machine to confirm that it utilizes GPU acceleration as intended.

System Details:

GPU Model: NVIDIA Tesla T4 Driver Version: 460.32.03 CUDA Version: 12.2 Operating System: Ubuntu 20.04

nvidia-smi Output:

Fri Sep 20 20:26:53 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:01:00.0 Off |                  Off |
| N/A   40C    P8     8W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
  1. Dorado Basecalling Output:

========== == CUDA ==

CUDA Version 12.2.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

[2024-09-20 20:28:50.255] [info] Running: "basecaller" "/dorado_models/dna_r10.4.1_e8.2_260bps_sup@v3.5.2" "/usr/src/app/dna_r10.4.1_e8.2_260bps-FLO_PRO114-SQK_NBD114_96_260-4000.pod5" "--emit-moves" [2024-09-20 20:28:50.375] [info] > Creating basecall pipeline [2024-09-20 20:28:52.273] [warning] Unable to find chunk benchmarks for GPU "Tesla T4", model /dorado_models/dna_r10.4.1_e8.2_260bps_sup@v3.5.2 and chunk size 1440. Full benchmarking will run for this device, which may take some time. [2024-09-20 20:29:08.590] [info] cuda:0 using chunk size 10000, batch size 256 [2024-09-20 20:29:10.402] [info] cuda:0 using chunk size 5000, batch size 640 [2024-09-20 20:29:15.202] [info] > Simplex reads basecalled: 1 [2024-09-20 20:29:15.202] [info] > Basecalled @ Samples/s: 8.841649e+02 [2024-09-20 20:29:15.202] [info] > Finished


4. Sample SAM File Output:
`samtools view basecalled.sam 
`
```bash

@HD VN:1.6  SO:unknown
@SQ SN:reference    LN:0
@PG ID:dorado   PN:dorado   VN:0.8.0    CL:dorado basecaller /dorado_models/dna_r10.4.1_e8.2_260bps_sup@v3.5.2 /usr/src/app/dna_r10.4.1_e8.2_260bps-FLO_PRO114-SQK_NBD114_96_260-4000.pod5 --emit-moves
c4b111a0-90eb-436e-b6b1-cf14dee1fb93    4   *   0   0   *   *   0   0   TATGTCTCTGGTTCGGTTGGTCTTGCTAGACACAGGAAGGGGGCCAGGGTGTCAGAGAGCAGAAGATGGGGTGAGGAGTGGTGGGAGCCAGCGTGGAAGGTGTTGACTCTATGGTGACCTGGGTCCCCTCCTGCACCAAGTGGGGTGGCAGTGAGCAGGGTGACTGTCGTCTATGCT   `$$'(($##"##'+*((((*++,,--.4322236465329**3/BC>====CAAAAAEECDD5))));9;;BFKKIDHF>>>>>GA@BCEDA?999::IGEDDA@54//))))&&&''(((,23(CCDDGEDCEFDBCCABA@@CDDJKOKKHMGEEEFJHGCCABABBBBCDBCCDB` qs:f:14.8748    du:f:0.512  ns:i:2048   ts:i:10 mx:i:1  ch:i:1  st:Z:2023-11-25T19:15:40.684+00:00  rn:i:1  fn:Z:dna_r10.4.1_e8.2_260bps-FLO_PRO114-SQK_NBD114_96_260-4000.pod5 sm:f:920.55 sd:f:179.67 sv:Z:quantile   dx:i:0  RG:Z:test_dna_r10.4.1_e8.2_260bps_sup@v3.5.2    mv:B:c,5,1,0,1,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,1,1,1,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,0,1,0,1,0,1,0,1,1,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,1,0,1,1,1,1,0,1,1,0,1,0,0,1,0,0,0,1,1,0,1,1,1,1,0,1,1,0,1,1,0,1,1,0,0,0,1,0,1,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,1,0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,1,1,1,1,0,1,0,1,1,0,0,0,1,0,0,1,0,1,0,1,0,1,1,0,1,0,1,0,1,0,1,0,1,1,0,1,0,1,0,1,0,1,1,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,1,1,0,1,1,0,1,0,1,0,1,0,1,0

Pull Request (PR) checklist:

kapsakcj commented 4 weeks ago

Nice work on this dockerfile šŸ‘ OK the GitHub Actions runners don't have enough storage space to run the automated tests. That base image is pretty large which contributed to that. In lieu of the automated tests I've built the docker image locally to test it out. It builds successfully and the tests passes āœ…

Requested changes:

/README.md:

/Program_licenses.md:

dorado/0.8.0/Dockerfile:

dorado/0.8.0/README.md:

One idea I had - users may not want to use a docker image that includes all models due to it's large size. They may be OK with just downloading the model at runtime and using that to save on docker image download time & storage space.

They take up ~4.2GB:

$ du -sch /dorado_models
4.2G    /dorado_models
4.2G    total

What if we had a nearly identical dockerfile/docker image, but just remove the step to download all the model files? I think that might be useful to someone. Just a thought. We can do that in the future if someone requests it, not necessary to include with this PR.

fraser-combe commented 3 weeks ago

I have updated the relevant files with the required information and removed the lines not needed, thanks for the review!

kapsakcj commented 2 weeks ago

Thanks for making those requested changes, Fraser. I made a few small updates to clarify info in the dorado-specific readme.md file and updated some links to match formatting from other links.

I was able to build the docker image locally, which does the cpu only test. It built successfully and tests passed. Here's the command I ran: docker_build -t fraser/dorado:0.8.0-updatedSept30 dorado/0.8.0/

I will test on a local computer with an NVIDIA GPU just to be sure this docker image will work on another machine with a GPU prior to approving/merging/deploying.

kapsakcj commented 2 weeks ago

OK I tested running this on a Windows 10 computer with WSL2 using ubuntu 20, with an NVIDIA 3080Ti GPU.

Here's output of nvidia-smi:

$ nvidia-smi
Mon Sep 30 11:34:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.02              Driver Version: 560.94         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     On  |   00000000:09:00.0  On |                  N/A |
|  0%   60C    P0            120W /  319W |    1567MiB /  12288MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        24      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        35      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        36      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

Built the image locally with this command which succeeded:

docker buildx build -f dorado/0.8.0/Dockerfile .

Downloaded the test POD5 file with the wget command from the readme, then basecalled using the GPU with this command:

$ docker run --gpus all -v $PWD:/data 4757f5abb7b5 bash -c "dorado basecaller /dorado_models/dna_r10.4.1_e8.2_260bps_sup@v3.5.2 /data/dna_r10.4.1_e8.2_260bps-FLO_PRO114-SQK_NBD114_96_260-4000.pod5 --emit-moves > /data/basecalled.sam"

==========
== CUDA ==
==========

CUDA Version 12.2.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

[2024-09-30 15:37:38.493] [info] Running: "basecaller" "/dorado_models/dna_r10.4.1_e8.2_260bps_sup@v3.5.2" "/data/dna_r10.4.1_e8.2_260bps-FLO_PRO114-SQK_NBD114_96_260-4000.pod5" "--emit-moves"
[2024-09-30 15:37:38.620] [info] > Creating basecall pipeline
[2024-09-30 15:37:39.275] [warning] Unable to find chunk benchmarks for GPU "NVIDIA GeForce RTX 3080 Ti", model /dorado_models/dna_r10.4.1_e8.2_260bps_sup@v3.5.2 and chunk size 1440. Full benchmarking will run for this device, which may take some time.
[2024-09-30 15:37:44.437] [info] cuda:0 using chunk size 10000, batch size 320
[2024-09-30 15:37:45.869] [info] cuda:0 using chunk size 5000, batch size 640
[2024-09-30 15:37:48.673] [info] > Simplex reads basecalled: 1
[2024-09-30 15:37:48.673] [info] > Basecalled @ Samples/s: 2.033932e+03
[2024-09-30 15:37:48.673] [info] > Finished

I was monitoring gpu activity with nvtop and saw a spike in GPU activity when running this āœ…

Double-checked the SAM file that was produced and it looks good:

# samtools view basecalled.sam
c4b111a0-90eb-436e-b6b1-cf14dee1fb93    4       *       0       0       *       *       0       0       TATGTCTCTGGTTCGGTTGGTCTTGCTAGACACAGGAAGGGGGCCAGGGTGTCAGAGAGCAGAAGATGGGGTGAGGAGTGGTGGGAGCCAGCGTGGAAGGTGTTGACTCTATGGTGACCTGGGTCCCCTCCTGCACCAAGTGGGGTGGCAGTGAGCAGGGTGACTGTCGTCTATGCT $$'(($##"##'+*((((*++,,--.4322236465329**3/BC>====CAAAAAEECDD5))));9;;BFKKIDHF>>>>>GA@BCEDA?999::IGEDDA@54//))))&&&''(((,23(CCDDGEDCEFDBCCABA@@CDDJKOKKHMGEEEFJHGCCABABBBBCDBCCDB    qs:f:14.8748    du:f:0.512      ns:i:2048       ts:i:10 mx:i:1  ch:i:1       st:Z:2023-11-25T19:15:40.684+00:00      rn:i:1  fn:Z:dna_r10.4.1_e8.2_260bps-FLO_PRO114-SQK_NBD114_96_260-4000.pod5     sm:f:920.55  sd:f:179.67     sv:Z:quantile   dx:i:0  RG:Z:test_dna_r10.4.1_e8.2_260bps_sup@v3.5.2    mv:B:c,5,1,0,1,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,1,1,1,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,0,1,0,1,0,1,0,1,1,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,1,0,0,0,1,1,0,0,0,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1,1,1,1,0,1,0,0,1,0,0,1,1,0,1,0,1,1,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,1,0,1,1,1,1,0,1,1,0,1,0,0,1,0,0,0,1,1,0,1,1,1,1,0,1,1,0,1,1,0,1,1,0,0,0,1,0,1,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,1,0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,1,1,1,1,0,1,0,1,1,0,0,0,1,0,0,1,0,1,0,1,0,1,1,0,1,0,1,0,1,0,1,0,1,1,0,1,0,1,0,1,0,1,1,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,1,1,0,1,1,0,1,0,1,0,1,0,1,0

Thanks for your patience with all the requested changes and testing. GPU-enabled software is not super common our collection of containers so I want to ensure it runs smoothly from the start šŸ‘

kapsakcj commented 2 weeks ago

deploy workflow might fail, but we will see: https://github.com/StaPH-B/docker-builds/actions/runs/11109669669

If it does I can deploy manually to dockerhub and quay

kapsakcj commented 2 weeks ago

I deployed the docker image manually. The hard drive is filling up on the github actions runner.