nn classification stalls for large datasets

jzrapp commented 1 year ago

Hi again @apcamargo!

I've been able to successfully run genomad on several datasets (metagenomes). During the nn-classification I always receive an error in the log that says:

2022-10-16 22:38:39.339717: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

But the software keeps running and was able to finish successfully in some cases. However, for some very large datasets, the log shows this error and the software keeps running for >20h without any additional info on whether or not something is happening in the background. I cannot find a way to see what process is being executed. I left them running for now, still hoping that they will finish like the others, but it would be great if there was more info in the log on what is going on.

Thanks!

apcamargo commented 1 year ago

Hi @jzrapp!

So, this error message just says that your hardware is not CUDA-compatible (which would make the neural network much faster). I'm not sure why your are getting the message, though. I've never saw it before. Can you provide your environment and hardware info?

As for the classification itself. geNomad classifies the data in batches to avoid overloading memory. You can try to increase the batch size using the --batch-size parameter (note that this parameter is only exposed if you run the nn-classification module separately). Let me know if you test this and find differences in classification speed.

That said, I used geNomad to classify tens of thousands of metagenomes/metatranscriptome (which included some pretty large assemblies) and the NN module never caused any issues (although it was the bottleneck for the really big assemblies).

If you don't want to test changing the batch size, you can always split the input FASTA and run multiple geNomad instances. Last case scenario, just skip the NN classification with --disable-nn-classification.

I'm aware that the NN module can be slow if the input is very large and I do have some plans to improve the situation, but right now I'm focusing on finishing the paper. It's definitely something that will improve in future releases :)

jzrapp commented 1 year ago

Hi,

sorry, I'm not sure if this is the info you wanted...but here is some info ;) My system is a shared compute cluster

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              1
Core(s) per socket:              16
Socket(s):                       8
NUMA node(s):                    8
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           63
Model name:                      Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz
Stepping:                        4

For running genomad I asked for 8 CPU and 150Gb (wasn't really sure what I needed).

Here is info on my conda env.

printenv
CONDA_BACKUP_RANLIB=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-ranlib
CONDA_SHLVL=2
LS_COLORS=no=00:fi=00:di=01;34:ln=00;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=41;33;01:ex=00;32:*.cmd=00;32:*.exe=01;32:*.com=01;32:*.bat=01;32:*.btm=01;32:*.dll=01;32:*.tar=00;31:*.tbz=00;31:*.tgz=00;31:*.rpm=00;31:*.deb=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.lzma=00;31:*.zip=00;31:*.zoo=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.tb2=00;31:*.tz2=00;31:*.tbz2=00;31:*.xz=00;31:*.avi=01;35:*.bmp=01;35:*.dl=01;35:*.fli=01;35:*.gif=01;35:*.gl=01;35:*.jpg=01;35:*.jpeg=01;35:*.mkv=01;35:*.mng=01;35:*.mov=01;35:*.mp4=01;35:*.mpg=01;35:*.pcx=01;35:*.pbm=01;35:*.pgm=01;35:*.png=01;35:*.ppm=01;35:*.svg=01;35:*.tga=01;35:*.tif=01;35:*.webm=01;35:*.webp=01;35:*.wmv=01;35:*.xbm=01;35:*.xcf=01;35:*.xpm=01;35:*.aiff=00;32:*.ape=00;32:*.au=00;32:*.flac=00;32:*.m4a=00;32:*.mid=00;32:*.mp3=00;32:*.mpc=00;32:*.ogg=00;32:*.voc=00;32:*.wav=00;32:*.wma=00;32:*.wv=00;32:
CONDA_EXE=/prg/miniconda/3/bin/conda
HOSTTYPE=x86_64
CONDA_BACKUP_OBJCOPY=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-objcopy
SSH_CONNECTION=10.247.251.37 57222 10.250.245.40 22
XAUTHLOCALHOSTNAME=manitou
LESSCLOSE=lessclose.sh %s %s
XKEYSYMDB=/usr/X11R6/lib/X11/XKeysymDB
CONDA_BACKUP_AR=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-ar
CONDA_BACKUP_AS=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-as
LANG=en_US.UTF-8
WINDOWMANAGER=/usr/bin/icewm-session
LESS=-M -I -R
CONDA_BACKUP_DEBUG_FORTRANFLAGS=-fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /prg/miniconda/3/include
CONDA_BACKUP_FC=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gfortran
DISPLAY=localhost:16.0
JAVA_ROOT=/usr/lib64/jvm/java
CONDA_BACKUP_DEBUG_CXXFLAGS=-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /prg/miniconda/3/include
HOSTNAME=manitou
CONDA_BACKUP_FFLAGS=-fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /prg/miniconda/3/include
OLDPWD=/home/jorap2
CONFIG_SITE=/usr/share/site/x86_64-unknown-linux-gnu
CSHEDIT=emacs
GPG_TTY=/dev/pts/74
LESS_ADVANCED_PREPROCESSOR=no
CONDA_BACKUP_DEBUG_CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /prg/miniconda/3/include
CONDA_BACKUP_CC=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-cc
CONDA_BACKUP_CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /prg/miniconda/3/include
COLORTERM=1
CONDA_PREFIX=/home/jorap2/.conda/envs/genomad
JAVA_HOME=/usr/lib64/jvm/java
MACHTYPE=x86_64-suse-linux
MINICOM=-c on
_CE_M=
QT_SYSTEM_DIR=/usr/share/desktop-data
OSTYPE=linux
XDG_SESSION_ID=118030
CONDA_BACKUP_STRIP=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-strip
USER=jorap2
PAGER=less
CONDA_PREFIX_1=/prg/miniconda/3
MODULE_VERSION=3.2.10
CONDA_BACKUP_CPPFLAGS=-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /prg/miniconda/3/include
MORE=-sl
PWD=/home/jorap2/genomad_v1.1.0
HOME=/home/jorap2
CONDA_PYTHON_EXE=/prg/miniconda/3/bin/python
CONDA_BACKUP_GCC_RANLIB=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gcc-ranlib
HOST=x86_64-conda_cos6-linux-gnu
SSH_CLIENT=10.247.251.37 57222 22
XNLSPATH=/usr/X11R6/lib/X11/nls
XDG_SESSION_TYPE=tty
SDK_HOME=/usr/lib64/jvm/java
XDG_DATA_DIRS=/usr/share
CONDA_BACKUP_STRINGS=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-strings
CONDA_BACKUP_CXXFILT=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-c++filt
CONDA_BACKUP_SIZE=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-size
CONDA_BACKUP_HOST=x86_64-conda_cos6-linux-gnu
_CE_CONDA=
LIBGL_DEBUG=quiet
JDK_HOME=/usr/lib64/jvm/java
PROFILEREAD=true
LOADEDMODULES=
CONDA_BACKUP_READELF=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-readelf
CONDA_BACKUP_CPP=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-cpp
CONDA_PROMPT_MODIFIER=(genomad) 
CONDA_BACKUP_LD=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-ld
SSH_TTY=/dev/pts/74
FROM_HEADER=
CONDA_BACKUP_CXX=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-c++
MAIL=/var/spool/mail/jorap2
LESSKEY=/etc/lesskey.bin
TERM=xterm-256color
SHELL=/bin/bash
XDG_SESSION_CLASS=user
CONDA_BACKUP_F77=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gfortran
CONDA_BACKUP_GPROF=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gprof
LS_OPTIONS=-N --color=tty -T 0
CONDA_BACKUP_ADDR2LINE=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-addr2line
PERL5LIB=
CONDA_BACKUP_F95=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-f95
CONDA_BACKUP_DEBUG_FFLAGS=-fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /prg/miniconda/3/include
PYTHONSTARTUP=/etc/pythonstart
CONDA_BACKUP_ELFEDIT=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-elfedit
SHLVL=1
G_FILENAME_ENCODING=@locale,UTF-8,ISO-8859-15,CP1252
CONDA_BACKUP_CMAKE_PREFIX_PATH=/prg/miniconda/3:/prg/miniconda/3/x86_64-conda_cos6-linux-gnu/sysroot/usr
MANPATH=/usr/local/man:/usr/local/share/man:/usr/share/man
CONDA_BACKUP_LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/prg/miniconda/3/lib -Wl,-rpath-link,/prg/miniconda/3/lib -L/prg/miniconda/3/lib
F90=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gfortran
MODULEPATH=/prg/Modules/bioinfo:/prg/Modules/language:/prg/Modules/compilateurs:/prg/Modules/utilitaire:/usr/share/modules
LOGNAME=jorap2
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1568/bus
XDG_RUNTIME_DIR=/run/user/1568
MODULE_VERSION_STACK=3.2.10
LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/prg/miniconda/3/lib -Wl,-rpath-link,/prg/miniconda/3/lib -L/prg/miniconda/3/lib
CONDA_BACKUP_FORTRANFLAGS=-fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /prg/miniconda/3/include
CONDA_BACKUP_GXX=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-g++
JRE_HOME=/usr/lib64/jvm/java
CONDA_BACKUP_GCC_NM=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gcc-nm
XDG_CONFIG_DIRS=/etc/xdg
PATH=/home/jorap2/.conda/envs/genomad/bin:/prg/miniconda/3/condabin:/home/jorap2/bin:/usr/local/bin:/usr/bin:/bin:/usr/lib/mit/sbin:/usr/local/cuda/bin
JAVA_BINDIR=/usr/lib64/jvm/java/bin
CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME=_sysconfigdata_x86_64_conda_cos6_linux_gnu
CONDA_BACKUP_GCC=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gcc
MODULESHOME=/usr/share/Modules/3.2.10
CONDA_DEFAULT_ENV=genomad
G_BROKEN_FILENAMES=1
HISTSIZE=1000
CONDA_BACKUP_CONDA_BUILD_SYSROOT=/prg/miniconda/3/x86_64-conda_cos6-linux-gnu/sysroot
CONDA_BACKUP_DEBUG_CPPFLAGS=-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /prg/miniconda/3/include
CONDA_BACKUP_CXXFLAGS=-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /prg/miniconda/3/include
CONDA_BACKUP_GFORTRAN=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gfortran
CPU=x86_64
SSH_SENDS_LOCALE=yes
CONDA_BACKUP_OBJDUMP=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-objdump
CONDA_BACKUP_GCC_AR=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-gcc-ar
CONDA_BACKUP_NM=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-nm
CVS_RSH=ssh
CONDA_BACKUP_LD_GOLD=/prg/miniconda/3/bin/x86_64-conda_cos6-linux-gnu-ld.gold
BASH_FUNC_module%%=() {  eval `/usr/share/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
_=/usr/bin/printenv

And sorry if you were asking for something entirely different :/ I will play with your suggestions some more next week! Thank you so much for being so responsive to these issues!

apcamargo commented 1 year ago

Thanks!

I'm still not sure what is causing the CUDA warning. I tried geNomad in a different server and got it. I'll investigate this.

In any case, the results were unaffected, so you don't need to worry about it :)

apcamargo commented 1 year ago

Hey @jzrapp, do you have any updates on this? Just wanted to be sure if I can close this bug and update the tutorial with additional recommendations for big datasets.

jzrapp commented 1 year ago

Hi @apcamargo, I'm sorry for my delayed response! I didn't get to it all of last week, and just restarted running those analyses. The easiest and quickest thing for me was to just split them up into 500MB chunks. Hopefully, this can do the trick for now. The other fasta files of around that size finished without issue. Sorry that I cannot provide a nicer workaround :/

apcamargo commented 1 year ago

No problem! I'll write a section in the tutorial with recommendations for very big datasets (disabling the nn classification, reducing the search sensitivity, etc.).

When I got the time, I'll update the NN module with a smaller model to make things faster. It is not a priority right now because it runs pretty quicky on the vast majority of the datasets. The might be an issue on how tensorflow is dealing with the datastream. I might investigate at which point this happens, since there's an abrupt slowdown at some point when datasets get bigger.

Let me know if you have updates on this! (or anything else)

apcamargo / genomad

nn classification stalls for large datasets #3