apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
189 stars 17 forks source link

Tenserflow error for nn-classification #16

Closed michoug closed 1 year ago

michoug commented 1 year ago

Hi When running the end-to-end module, I got this error for the genomad (version 1.5.0) nn-classification.

genomad end-to-end --cleanup --threads 25 GFS_2469.fa GFS_2469_genomad ~/Desktop/Databases/Genomad/genomad_db/

I'm running this on a ubuntu machine with 250GB of RAM and it stops without really using any of the memory

[10:22:24] Executing genomad nn-classification.
[10:22:24] Creating the GFS_2469_genomad/GFS_2469_nn_classification directory.
[10:22:24] Creating the GFS_2469_genomad/GFS_2469_nn_classification/GFS_2469_encoded_sequences directory.
[10:22:26] Encoded sequence data written to GFS_2469_encoded_sequences.
Traceback (most recent call last):
  File "/home/river/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1239, in end_to_end
    ctx.invoke(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 694, in nn_classification
    genomad.nn_classification.main(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/nn_classification.py", line 285, in main
    contig_predictions = nn_model.predict(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

Detected at node 'model_1/model/conv1d/Pad' defined at (most recent call last):
    File "/home/river/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
      sys.exit(cli())
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
      return self.main(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
      rv = super().main(*args, standalone_mode=False, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
      rv = self.invoke(ctx)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
      return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
      return f(get_current_context(), *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1239, in end_to_end
      ctx.invoke(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 694, in nn_classification
      genomad.nn_classification.main(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/nn_classification.py", line 285, in main
      contig_predictions = nn_model.predict(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2350, in predict
      tmp_batch_outputs = self.predict_function(iterator)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2137, in predict_function
      return step_function(self, iterator)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2123, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2111, in run_step
      outputs = model.predict_step(data)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2079, in predict_step
      return self(x, training=False)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 561, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 561, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/layers/convolutional/base_conv.py", line 276, in call
      inputs = tf.pad(inputs, self._compute_causal_padding(inputs))
Node: 'model_1/model/conv1d/Pad'
OOM when allocating tensor with shape[128,6002,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model_1/model/conv1d/Pad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_predict_function_942]
apcamargo commented 1 year ago

Hi @michoug

Can you check which version of Tensorflow you have installed? Just run python -c "import tensorflow as tf;print(tf.__version__)".

Also, does your machine have a GPU? It seems that it is the GPU that is running out of memory. If that's the case, I can release a fix that forces the use of the CPU.

In the meantime, you can try to run the genomad nn-classification command with a smaller batch size (for example, --batch-size 32 or --batch-size 16). If this works, you can run the end-to-end command to execute the remaining modules.

michoug commented 1 year ago

Hi @apcamargo The tenserflow version is 2.11.0 and there is a GPU with 4 Gb of memory (NVIDIA T400 Gb). When running with --batch-size 32, I got the same error but with --batch-size 16, I got another one. Probably a driver issue on my part

[16:02:20] Executing genomad nn-classification.
[16:02:20] Creating the Genomad/336R_concoct_107_genomad/336R_concoct_107_nn_classification/336R_concoct_107_encoded_sequences directory.
[16:02:22] Encoded sequence data written to 336R_concoct_107_encoded_sequences.
⠹ Classifying sequences.2023-03-15 16:02:26.054685: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:433] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2023-03-15 16:02:26.054820: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Possibly insufficient driver version: 510.108.3
Traceback (most recent call last):
  File "/home/river/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 694, in nn_classification
    genomad.nn_classification.main(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/nn_classification.py", line 285, in main
    contig_predictions = nn_model.predict(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

Detected at node 'model_1/model/conv1d/Conv1D' defined at (most recent call last):
    File "/home/river/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
      sys.exit(cli())
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
      return self.main(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
      rv = super().main(*args, standalone_mode=False, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
      rv = self.invoke(ctx)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
      return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 694, in nn_classification
      genomad.nn_classification.main(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/nn_classification.py", line 285, in main
      contig_predictions = nn_model.predict(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2350, in predict
      tmp_batch_outputs = self.predict_function(iterator)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2137, in predict_function
      return step_function(self, iterator)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2123, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2111, in run_step
      outputs = model.predict_step(data)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2079, in predict_step
      return self(x, training=False)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 561, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 561, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/layers/convolutional/base_conv.py", line 283, in call
      outputs = self.convolution_op(inputs, self.kernel)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/layers/convolutional/base_conv.py", line 255, in convolution_op
      return tf.nn.convolution(
Node: 'model_1/model/conv1d/Conv1D'
DNN library is not found.
         [[{{node model_1/model/conv1d/Conv1D}}]] [Op:__inference_predict_function_942]
michoug commented 1 year ago

Hi, I resolved all the issues by installing specific version of cudatoolkit=11.2.2 and cudnn=8.1.0 Best Greg

apcamargo commented 1 year ago

Thanks for the feedback, @michoug. I'll add instructions for running geNomad on a GPU on a future update. This report was really useful!