czbiohub-sf / nf-predictorthologs

*de novo* orthologous gene predictions from bam + bed or fasta/fastq data
MIT License
4 stars 2 forks source link

For diff hash, do sourmash search then diamond on unassigned hashes #79

Closed olgabot closed 4 years ago

olgabot commented 4 years ago

Currently, for differential hash expression if one wants to search with DIAMOND then first all the differential hashes must be converted to kmers/sequences with k-mers. This is EXTREMELY time consuming as there can be many thousands of hashes (example below).

executor >  local (23653)
[0b/735c28] process > get_software_versions                     [100%] 1 of 1 ✔
[a7/621f43] process > diff_hash (mouse)                         [100%] 3 of 3 ✔
[f0/65a821] process > sigs_with_hash (hash-675266686049797853)  [ 35%] 23616 of 68452
[-        ] process > hash2kmer                                 -
[-        ] process > diamond_blastp                            -
[e1/6c3f53] process > multiqc                                   [100%] 1 of 1 ✔
[c5/6350d8] process > output_documentation                      [100%] 1 of 1, cached: 1 ✔

This PR first does sourmash gather (search with iterative removal) on those hashes, then for the unassigned hashes, convert only those to sequences and then search with DIAMOND.

Addresses: https://github.com/czbiohub/nf-predictorthologs/issues/63 and https://github.com/czbiohub/nf-predictorthologs/issues/47

PR checklist

Related: https://github.com/dib-lab/sourmash/pull/1151

olgabot commented 4 years ago

For some reason, when I run this with docker, it seems that it can't even write the sbt:

Nextflow command ``` (base) ✘  Mon 10 Aug - 16:53  ~/code/tabula-microcebus--olgabot/kmermaid-at2-minitest-analyze-descriptions/workflows/predictorthologs/kmermaid-30cell-minitest   origin ☊ olgabot/kmermaid-at2-minitest-analyze-descriptions 2☀ 1●  olga@tesla  make all nextflow run -r olgabot/sourmash-then-diamond -latest -w ~/data_sm/nextflow-work/ czbiohub/nf-predictorthologs --diff_hash_expression --protein_searcher sourmash --sourmash_ksize 30 --sourmash_molecule dayhoff --diff_hash_penalty l2 --diff_hash_inverse_regularization_strength 1 --input_is_protein --protein_searcher sourmash --proteome_search_fasta /home/olga/data_lg/czbiohub-reference/ncbi/refseq/releases/refseq-release98-2020-02-06/vertebrate_mammalian/vertebrate_mammalian_concatenated__np_only.faa.gz --diamond_database /home/olga/data_lg/czbiohub-reference/ncbi/refseq/releases/refseq-release98-2020-02-06/vertebrate_mammalian_db.dmnd -profile docker --sourmash_scaled 10 -resume \ --csv narrow_group_gather.csv \ --outdir /home/olga/data_sm/tabula-microcebus/analyses/predictorthologs/kmermaid-minitest-30cells/narrow_group_gather N E X T F L O W ~ version 20.04.1-edge Pulling czbiohub/nf-predictorthologs ... Launching `czbiohub/nf-predictorthologs` [amazing_sanger] - revision: ccab9393d6 [olgabot/sourmash-then-diamond] Using protein fastas as input -- ignoring reads and bams WARN: The `into` operator should be used to connect two or more target channels -- consider to replace it with `.set { ch_all_signatures_flat_list_for_diff_hash }` WARN: The `into` operator should be used to connect two or more target channels -- consider to replace it with `.set { ch_group_to_signatures }` WARN: The `into` operator should be used to connect two or more target channels -- consider to replace it with `.set { ch_groups_with_all_signatures_for_diff_hash }` ---------------------------------------------------- ,--./,-. ___ __ __ __ ___ /,-._.--~' |\ | |__ __ / ` / \ |__) |__ } { | \| | \__, \__/ | \ |___ \`-._,-`-, `._,._,' nf-core/predictorthologs v1.0dev ---------------------------------------------------- Pipeline Release : olgabot/sourmash-then-diamond Run Name : amazing_sanger CSV of samples : narrow_group_gather.csv CSV of input reads : narrow_group_gather.csv Diff Hash : true sourmash ksize : 30 sourmash molecule : dayhoff Diff Hash abundance? : false Diff Hash C : 1 Diff Hash solver : saga Diff Hash penalty : l2 Proteome search ref : /home/olga/data_lg/czbiohub-reference/ncbi/refseq/releases/refseq-release98-2020-02-06/vertebrate_mammalian/vertebrate_mammalian_concatenated__np_only.faa.gz Protein searcher : sourmash sourmash scaled : 10 DIAMOND pre-build database: /home/olga/data_lg/czbiohub-reference/ncbi/refseq/releases/refseq-release98-2020-02-06/vertebrate_mammalian_db.dmnd Data Type : Paired-End Max Resources : 128 GB memory, 16 cpus, 10d time per job Container : docker - czbiohub/predictorthologs:dev Output dir : /home/olga/data_sm/tabula-microcebus/analyses/predictorthologs/kmermaid-minitest-30cells/narrow_group_gather Launch dir : /home/olga/code/tabula-microcebus--olgabot/kmermaid-at2-minitest-analyze-descriptions/workflows/predictorthologs/kmermaid-30cell-minitest Working dir : /home/olga/data_sm/nextflow-work Script dir : /home/olga/.nextflow/assets/czbiohub/nf-predictorthologs User : olga Config Profile : docker ---------------------------------------------------- ```
Sourmash error message ``` Error executing process > 'sourmash_db_index (vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-10__track_abundance-true)' Caused by: Process `sourmash_db_index (vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-10__track_abundance-true)` terminated with an error exit status (1) Command executed: sourmash index \ --ksize 30 \ --dayhoff \ vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-10__track_abundance-true.sbt.zip \ vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-10__track_abundance-true.sig Command exit status: 1 Command output: (empty) Command error: 152001 of 306185 nodes saved 152101 of 306185 nodes saved 152201 of 306185 nodes saved 152301 of 306185 nodes saved 152401 of 306185 nodes saved 152501 of 306185 nodes saved 152601 of 306185 nodes saved 152701 of 306185 nodes saved 152801 of 306185 nodes saved 152901 of 306185 nodes saved 153001 of 306185 nodes saved 153101 of 306185 nodes saved 153201 of 306185 nodes saved Traceback (most recent call last): File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/sbt_storage.py", line 161, in save self._save_to_zf(self.zipfile, path, content) File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/sbt_storage.py", line 152, in _save_to_zf raise ValueError("This will insert duplicated entries") ValueError: This will insert duplicated entries During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/bin/sourmash", line 11, in sys.exit(main()) File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/__main__.py", line 14, in main return mainmethod(args) File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/cli/index.py", line 52, in main return sourmash.commands.index(args) File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/commands.py", line 409, in index tree.save(args.sbt_name, sparseness=args.sparseness) File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/sbt.py", line 565, in save node.save(os.path.join(subdir, data['filename'])) File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/sbtmh.py", line 51, in save return self.storage.save(path, buf) File "/opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/sbt_storage.py", line 168, in save raise ValueError("can't write data") ValueError: can't write data Work dir: /home/olga/data_sm/nextflow-work/7c/0dc98180c96db1acc7dbf63d7b0702 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run` ```

But when I change to the folder and use my local sourmash, it seems to build fine??

(sourmash)
 Mon 10 Aug - 17:04  ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a 
 olga@tesla  bash .command.sh

== This is sourmash version 3.2.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 1 files into SBT

loaded 153093 sigs; saving SBT under "vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip"
136585 of 306185 nodes saved
olgabot commented 4 years ago

Hmm it seems to be at least partly due to that my local sourmash was version 3.2.3 but the docker container had a newer version,

(sourmash)
 Mon 10 Aug - 17:03  ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a 
 olga@tesla  cat .command.sh
#!/bin/bash -euo pipefail
sourmash index \
    --ksize 30 \
    --dayhoff \
    vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip \
    vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig
(sourmash)
 Mon 10 Aug - 17:04  ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a 
 olga@tesla  bash .command.sh

== This is sourmash version 3.2.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 1 files into SBT

loaded 153093 sigs; saving SBT under "vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip"

Finished saving nodes, now saving SBT json file.
(sourmash)
 Mon 10 Aug - 17:41  ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a 
 olga@tesla  ll
Permissions Size User Group Date Modified Name
.rw-r--r--@    0 olga czb   10 Aug 16:41  .command.begin
.rw-r--r--@  51k olga czb   10 Aug 16:50  .command.err
.rw-r--r--@  51k olga czb   10 Aug 16:50  .command.log
.rw-r--r--@    0 olga czb   10 Aug 16:41  .command.out
.rw-r--r--@  10k olga czb   10 Aug 16:41  .command.run
.rw-r--r--@  304 olga czb   10 Aug 16:41  .command.sh
.rw-r--r--@    0 olga czb   10 Aug 16:41  .command.trace
.rw-r--r--@    1 olga czb   10 Aug 16:50  .exitcode
drwxr-xr-x@    - olga czb   10 Aug 16:49  .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true
drwxr-xr-x@    - olga czb   10 Aug 17:40  .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-r--r--@ 628M olga czb   10 Aug 16:50  vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-r--r--@  38M olga czb   10 Aug 17:40  vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip.sbt.json
lrwxrwxrwx@  171 olga czb   10 Aug 16:41  vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig -> /mnt/ibm_sm/olga/nextflow-work/9c/75606842cfb9894ba7575f7a7a19b3/vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig

Docker sourmash version

base)
 ✘  Mon 10 Aug - 17:45  ~/data_sm/tabula-microcebus/analyses/predictorthologs/kmermaid-minitest-30cells/narrow_group_gather/reference/sourmash 
 olga@tesla  docker run czbiohub/predictorthologs:dev sourmash info

== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

sourmash version 3.3.0
- loaded from path: /opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/cli
olgabot commented 4 years ago

Upgrading sourmash locally to 3.4.1 seems to be working okay so far ...

(sourmash)
 Mon 10 Aug - 17:44  ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a 
 olga@tesla  bash .command.sh                      
== This is sourmash version 3.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 1 files into SBT
<<<le-dayhoff__ksize-30__scaled-1__track_abundance-true.sig' / 81460 sigs total
olgabot commented 4 years ago

There was an error but the index seemed saved okay?

(sourmash)
 Mon 10 Aug - 17:44  ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a 
 olga@tesla  bash .command.sh
== This is sourmash version 3.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 1 files into SBT
<<<ed__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig'

loaded 153093 sigs; saving SBT under "vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip"
Finished saving nodes, now saving SBT index file.
Exception ignored in: <function _TemporaryFileCloser.__del__ at 0x7ff132002b90>
Traceback (most recent call last):
  File "/home/olga/anaconda/envs/sourmash/lib/python3.7/tempfile.py", line 448, in __del__
    self.close()
  File "/home/olga/anaconda/envs/sourmash/lib/python3.7/tempfile.py", line 444, in close
    unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp9oeyypa7'
Finished saving SBT index, available at /mnt/ibm_sm/olga/nextflow-work/a3/e07da17bac07a93158560abec59c0a/vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip

(sourmash)
 Tue 11 Aug - 07:28  ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a 
 olga@tesla  ll
Permissions Size User Group Date Modified Name
.rw-r--r--@    0 olga czb   10 Aug 16:41  .command.begin
.rw-r--r--@  51k olga czb   10 Aug 16:50  .command.err
.rw-r--r--@  51k olga czb   10 Aug 16:50  .command.log
.rw-r--r--@    0 olga czb   10 Aug 16:41  .command.out
.rw-r--r--@  10k olga czb   10 Aug 16:41  .command.run
.rw-r--r--@  304 olga czb   10 Aug 16:41  .command.sh
.rw-r--r--@    0 olga czb   10 Aug 16:41  .command.trace
.rw-r--r--@    1 olga czb   10 Aug 16:50  .exitcode
drwxr-xr-x@    - olga czb   10 Aug 16:49  .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true
drwxr-xr-x@    - olga czb   10 Aug 17:40  .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-------@ 852M olga czb   10 Aug 17:53  vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-r--r--@  38M olga czb   10 Aug 17:40  vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip.sbt.json
lrwxrwxrwx@  171 olga czb   10 Aug 16:41  vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig -> /mnt/ibm_sm/olga/nextflow-work/9c/75606842cfb9894ba7575f7a7a19b3/vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig
olgabot commented 4 years ago

Omg! Tests finally pass!!!

olgabot commented 4 years ago

TODOs:

e.g. these

///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
/* --                                                                     -- */
/* --              DOWNLOAD REFSEQ REFERENCE PROTEOME                     -- */
/* --                                                                     -- */
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
/*
 * STEP 6 - rsync to download refeseq
 */
olgabot commented 4 years ago

Spent a bunch of time on this but I don't think it ended up being useful. When doing "sourmash search" using hashes from short k-mers (I was using dayhoff, dnaksize=30 meaning a aaksize=10), then the hashes are effectively random, and large genes get picked up. So unfortunately I think that the hash2kmer + DIAMOND is still the way to go. This may be a helpful option to have to turn on with a flag, but for my specific application, it's not useful.