StaPH-B / docker-builds

:package: :whale: Dockerfiles and documentation on tools for public health bioinformatics
GNU General Public License v3.0
187 stars 119 forks source link

adds checkv #1078

Closed Kincekara closed 2 weeks ago

Kincekara commented 2 weeks ago

CheckV is a tool for assessing the quality of metagenome-assembled viral genomes. It is actively maintained. It may be useful for metagenomics projects.

paper: https://www.nature.com/articles/s41587-020-00774-7 code: https://bitbucket.org/berkeleylab/checkv

Pull Request (PR) checklist:

erinyoung commented 2 weeks ago

It looks like the tests worked:

#11 [test 1/2] RUN checkv download_database /db
#11 0.285 
#11 0.285 CheckV v1.0.3: download_database
#11 0.285 [1/4] Checking latest version of CheckV's database...
#11 1.864 [2/4] Downloading 'checkv-db-v1.5'...
#11 37.35 [3/4] Extracting 'checkv-db-v1.5'...
#11 68.77 [4/4] Building DIAMOND database...
#11 111.8 Run time: 111.56 seconds
#11 111.8 Peak mem: 1.27 GB
#11 111.8 Download completed successfully.
#11 DONE 113.9s

#12 [test 2/2] RUN wget -q https://bitbucket.org/berkeleylab/checkv/raw/51a5293f75da04c5d9a938c9af9e2b879fa47bd8/test/test_sequences.fna &&    checkv end_to_end -d /db/checkv-db-v1.5 test_sequences.fna test_out -t 4
#12 0.778 
#12 0.778 CheckV v1.0.3: contamination
#12 0.778 [1/8] Reading database info...
#12 0.832 [2/8] Reading genome info...
#12 0.836 [3/8] Calling genes with prodigal-gv...
#12 3.859 [4/8] Reading gene info...
#12 3.880 [5/8] Running hmmsearch...
#12 38.94 [6/8] Annotating genes...
#12 38.95 [7/8] Identifying host regions...
#12 38.97 [8/8] Writing results...
#12 38.97 Run time: 38.2 seconds
#12 38.97 Peak mem: 0.16 GB
#12 38.98 
#12 38.98 CheckV v1.0.3: completeness
#12 38.98 [1/8] Skipping gene calling...
#12 38.98 [2/8] Initializing queries and database...
#12 39.33 [3/8] Running DIAMOND blastp search...
#12 48.45 [4/8] Computing AAI...
#12 48.71 [5/8] Running AAI based completeness estimation...
#12 48.79 [6/8] Running HMM based completeness estimation...
#12 48.85 [7/8] Determining genome copy number...
#12 48.97 [8/8] Writing results...
#12 48.98 Run time: 10.0 seconds
#12 48.98 Peak mem: 1.61 GB
#12 49.01 
#12 49.01 CheckV v1.0.3: complete_genomes
#12 49.01 [1/7] Reading input sequences...
#12 49.02 [2/7] Finding complete proviruses...
#12 49.02 [3/7] Finding direct/inverted terminal repeats...
#12 49.03 [4/7] Filtering terminal repeats...
#12 49.03 [5/7] Checking genome for completeness...
#12 49.03 [6/7] Checking genome for large duplications...
#12 49.03 [7/7] Writing results...
#12 49.03 Run time: 0.02 seconds
#12 49.03 Peak mem: 1.61 GB
#12 49.03 
#12 49.03 CheckV v1.0.3: quality_summary
#12 49.03 [1/6] Reading input sequences...
#12 49.03 [2/6] Reading results from contamination module...
#12 49.03 [3/6] Reading results from completeness module...
#12 49.03 [4/6] Reading results from complete genomes module...
#12 49.04 [5/6] Classifying contigs into quality tiers...
#12 49.04 [6/6] Writing results...
#12 49.04 Run time: 0.01 seconds
#12 49.04 Peak mem: 1.61 GB
#12 DONE 49.1s

How big is the database that it uses? Would it be worthwhile to include in the image?

Kincekara commented 2 weeks ago

The compressed size of the database is 1.6 GB. The tool can accept the database path with the "-d" flag. So it can be downloaded and used externally

erinyoung commented 2 weeks ago

Sounds good. I'll merge and deploy this.

If we want to add a database to an image, we can do what bakta does and have two images (one with a database and one without)

erinyoung commented 2 weeks ago

The gitub action for the deployment can be followed here : https://github.com/StaPH-B/docker-builds/actions/runs/11184337714

The image should be up on dockerhub and quay soon.

Kincekara commented 2 weeks ago

If we want to add a database to an image, we can do what bakta does and have two images (one with a database and one without)

It may be a good idea. I can make another PR when I have time.