Closed olgabot closed 4 years ago
For some reason, when I run this with docker, it seems that it can't even write the sbt:
But when I change to the folder and use my local sourmash, it seems to build fine??
(sourmash)
Mon 10 Aug - 17:04 ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a
olga@tesla bash .command.sh
== This is sourmash version 3.2.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading 1 files into SBT
loaded 153093 sigs; saving SBT under "vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip"
136585 of 306185 nodes saved
Hmm it seems to be at least partly due to that my local sourmash was version 3.2.3
but the docker container had a newer version,
(sourmash)
Mon 10 Aug - 17:03 ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a
olga@tesla cat .command.sh
#!/bin/bash -euo pipefail
sourmash index \
--ksize 30 \
--dayhoff \
vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip \
vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig
(sourmash)
Mon 10 Aug - 17:04 ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a
olga@tesla bash .command.sh
== This is sourmash version 3.2.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading 1 files into SBT
loaded 153093 sigs; saving SBT under "vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip"
Finished saving nodes, now saving SBT json file.
(sourmash)
Mon 10 Aug - 17:41 ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a
olga@tesla ll
Permissions Size User Group Date Modified Name
.rw-r--r--@ 0 olga czb 10 Aug 16:41 .command.begin
.rw-r--r--@ 51k olga czb 10 Aug 16:50 .command.err
.rw-r--r--@ 51k olga czb 10 Aug 16:50 .command.log
.rw-r--r--@ 0 olga czb 10 Aug 16:41 .command.out
.rw-r--r--@ 10k olga czb 10 Aug 16:41 .command.run
.rw-r--r--@ 304 olga czb 10 Aug 16:41 .command.sh
.rw-r--r--@ 0 olga czb 10 Aug 16:41 .command.trace
.rw-r--r--@ 1 olga czb 10 Aug 16:50 .exitcode
drwxr-xr-x@ - olga czb 10 Aug 16:49 .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true
drwxr-xr-x@ - olga czb 10 Aug 17:40 .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-r--r--@ 628M olga czb 10 Aug 16:50 vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-r--r--@ 38M olga czb 10 Aug 17:40 vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip.sbt.json
lrwxrwxrwx@ 171 olga czb 10 Aug 16:41 vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig -> /mnt/ibm_sm/olga/nextflow-work/9c/75606842cfb9894ba7575f7a7a19b3/vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig
base)
✘ Mon 10 Aug - 17:45 ~/data_sm/tabula-microcebus/analyses/predictorthologs/kmermaid-minitest-30cells/narrow_group_gather/reference/sourmash
olga@tesla docker run czbiohub/predictorthologs:dev sourmash info
== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
sourmash version 3.3.0
- loaded from path: /opt/conda/envs/nf-core-predictorthologs-1.0dev/lib/python3.7/site-packages/sourmash/cli
Upgrading sourmash locally to 3.4.1 seems to be working okay so far ...
(sourmash)
Mon 10 Aug - 17:44 ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a
olga@tesla bash .command.sh
== This is sourmash version 3.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading 1 files into SBT
<<<le-dayhoff__ksize-30__scaled-1__track_abundance-true.sig' / 81460 sigs total
There was an error but the index seemed saved okay?
(sourmash)
Mon 10 Aug - 17:44 ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a
olga@tesla bash .command.sh
== This is sourmash version 3.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading 1 files into SBT
<<<ed__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig'
loaded 153093 sigs; saving SBT under "vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip"
Finished saving nodes, now saving SBT index file.
Exception ignored in: <function _TemporaryFileCloser.__del__ at 0x7ff132002b90>
Traceback (most recent call last):
File "/home/olga/anaconda/envs/sourmash/lib/python3.7/tempfile.py", line 448, in __del__
self.close()
File "/home/olga/anaconda/envs/sourmash/lib/python3.7/tempfile.py", line 444, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp9oeyypa7'
Finished saving SBT index, available at /mnt/ibm_sm/olga/nextflow-work/a3/e07da17bac07a93158560abec59c0a/vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
(sourmash)
Tue 11 Aug - 07:28 ~/data_sm/nextflow-work/a3/e07da17bac07a93158560abec59c0a
olga@tesla ll
Permissions Size User Group Date Modified Name
.rw-r--r--@ 0 olga czb 10 Aug 16:41 .command.begin
.rw-r--r--@ 51k olga czb 10 Aug 16:50 .command.err
.rw-r--r--@ 51k olga czb 10 Aug 16:50 .command.log
.rw-r--r--@ 0 olga czb 10 Aug 16:41 .command.out
.rw-r--r--@ 10k olga czb 10 Aug 16:41 .command.run
.rw-r--r--@ 304 olga czb 10 Aug 16:41 .command.sh
.rw-r--r--@ 0 olga czb 10 Aug 16:41 .command.trace
.rw-r--r--@ 1 olga czb 10 Aug 16:50 .exitcode
drwxr-xr-x@ - olga czb 10 Aug 16:49 .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true
drwxr-xr-x@ - olga czb 10 Aug 17:40 .sbt.vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-------@ 852M olga czb 10 Aug 17:53 vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip
.rw-r--r--@ 38M olga czb 10 Aug 17:40 vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sbt.zip.sbt.json
lrwxrwxrwx@ 171 olga czb 10 Aug 16:41 vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig -> /mnt/ibm_sm/olga/nextflow-work/9c/75606842cfb9894ba7575f7a7a19b3/vertebrate_mammalian_concatenated__np_only__molecule-dayhoff__ksize-30__scaled-1__track_abundance-true.sig
Omg! Tests finally pass!!!
TODOs:
e.g. these
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
/* -- -- */
/* -- DOWNLOAD REFSEQ REFERENCE PROTEOME -- */
/* -- -- */
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
/*
* STEP 6 - rsync to download refeseq
*/
Spent a bunch of time on this but I don't think it ended up being useful. When doing "sourmash search" using hashes from short k-mers (I was using dayhoff, dnaksize=30 meaning a aaksize=10), then the hashes are effectively random, and large genes get picked up. So unfortunately I think that the hash2kmer + DIAMOND is still the way to go. This may be a helpful option to have to turn on with a flag, but for my specific application, it's not useful.
Currently, for differential hash expression if one wants to search with DIAMOND then first all the differential hashes must be converted to kmers/sequences with k-mers. This is EXTREMELY time consuming as there can be many thousands of hashes (example below).
This PR first does
sourmash gather
(search with iterative removal) on those hashes, then for the unassigned hashes, convert only those to sequences and then search with DIAMOND.Addresses: https://github.com/czbiohub/nf-predictorthologs/issues/63 and https://github.com/czbiohub/nf-predictorthologs/issues/47
PR checklist
nextflow run . -profile test,docker
).nf-core lint .
).docs
is updatedCHANGELOG.md
is updatedREADME.md
is updatedRelated: https://github.com/dib-lab/sourmash/pull/1151