difficulty running/loading model on GPU

murakdar commented 5 years ago

I have been trying to predict the structure of a new sequence using the available pre-trained model (CASP11), but I've so far been unsuccessful in running the model. Note that I was equally unsuccessful in training a new model, with similar errors as below, but I will frame this in the context of the prediction task.

First, I successfully followed the input preparation steps provided in the README (i.e. using HMMER and convert scripts). Then, I slightly modified the configuration file to locate the .tfrecord files to be tested. From inside the rgn directory, I run python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing.

The resulting error is:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU,XLA_CPU,XLA_GPU], Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]

A complete log file is found at the end of this message. Training a new model based on the ProteinNet data sets also doesn't work for me, with a similar error. I suspect the underlying culprit is the following line:

2019-07-02 21:39:54.085506: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

However, I know that the machine does have a working GPU on which other applications can run. For example, the command python -c 'import tensorflow as tf; sess = tf.Session(); devices = sess.list_devices(); print(devices)' works as expected; the resulting output is:

2019-07-02 21:51:19.765631: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-02 21:51:19.923013: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-02 21:51:19.923728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-02 21:51:19.923765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-02 21:51:20.365814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-02 21:51:20.365876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-02 21:51:20.365895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-02 21:51:20.366064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14047 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 3962879756071663290), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 3582480176640480454), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5594773058756615672), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 14730090906, 11541728406927441233)]

I am using TensorFlow 1.12.0 with CUDA 9.0 on Python 2.7.12. Trying with or without export CUDA_VISIBLE_DEVICES=0 had no effect. I'd be happy to provide any additional information that could be useful.

Finally, I'm not sure if it's relevant to this particular issue, but I was also unable to successfully run python tests.py (from within rgn/models). (This is after extracting tests_data.zip and adjusting base_dir on line 20 accordingly.) After some deprecation warnings, here is the output from the first two unit tests:

======================================================================
ERROR: testBidirectionalCudnnLSTM (__main__.CanonicalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 1591, in testBidirectionalCudnnLSTM
    rtol=1e-4, atol=1e-4, use_gpu=True, restart_every_iteration=True)
  File "tests.py", line 223, in _testCore
    m_train.finish(sess, save=True, close_session=False, reset_graph=False)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 492, in _finish
    self._coordinator.join(self._threads)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 257, in _run
    enqueue_callable()
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1267, in _single_tensor_run
    results = self._call_tf_sessionrun(None, {}, fetch_list, [], None)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
NotFoundError: baseDirectory/data/CASP11Thinning30TwoResidueShiftEvoUniParcBakerJackHMMERNeg10JackHMMERNeg10/training/full/1; No such file or directory
     [[{{node RGN/model_0/read_protein/ReaderReadV2}} = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/model_0/read_protein/TFRecordReaderV2, RGN/model_0/file_queue)]]
     [[{{node RGN/model_0/batching_queue/cond/padding_fifo_queue_enqueue/_36}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_107_RGN/model_0/batching_queue/cond/padding_fifo_queue_enqueue", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

======================================================================
ERROR: testBidirectionality (__main__.CanonicalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 506, in testBidirectionality
    [[9918.58933468,  8952.59069162,  9176.94079926,  8796.51937957, 12218.92350567,  10755.31931812,   9559.9827963 ,   5893.13110397, 5506.3973903 ,   7582.1031883 ,  10850.59082285,  11665.04905976, 10217.72346162,   8608.70925565,   4039.71197761,   9195.48430789, 12097.81036358,   9139.1117249 ,   7955.98830914,   7179.4971963 , 5227.11424296,   7736.59951981,  10184.379717  ,   7659.47643575, 8075.85901917,   2743.33191322]]]})
  File "tests.py", line 230, in _testCore
    m_train, m_evals = self._createModel(c_train, c_evals)
  File "tests.py", line 147, in _createModel
    m_train = RGNModel('training', c_train)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 117, in __init__
    raise RuntimeError('Model already started; cannot create new objects.')
RuntimeError: Model already started; cannot create new objects.

The remaining tests all raise the same RuntimeError: Model already started; cannot create new objects. Moreover, running an individual test doesn't seem to produce any useful output:

$ python tests.py CanonicalTest.testBidirectionality
ERROR:tensorflow:Starting: testBidirectionality
<...snipped warnings...>
ERROR:tensorflow:Finished: testBidirectionality
.
----------------------------------------------------------------------
Ran 1 test in 7.717s

OK

Here is the complete output log file located in ../models/RGN12/logs/CASP12.log:

WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:543: string_input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:276: input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:188: limit_epochs (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)`.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:197: __init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:197: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/net_ops.py:115: __init__ (from tensorflow.python.ops.io_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:575: maybe_batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.filter(...).batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/net_ops.py:204: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/geom_ops.py:98: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
*** training configuration ***
{'architecture': {'all_to_all_peepholes': False,
                  'all_to_recurrent_skip_connections': False,
                  'alphabet_size': 60,
                  'alphabet_trainable': True,
                  'bidirectional': True,
                  'first_residual_connection_from_nth_layer': 1,
                  'higher_order_layers': True,
                  'include_dihedrals_between_layers': False,
                  'include_evolutionary': True,
                  'include_primary': True,
                  'include_recurrent_outputs_between_layers': True,
                  'input_to_recurrent_skip_connections': False,
                  'recurrent_layer_size': [800, 800],
                  'recurrent_nonlinear_out_proj_function': 'tanh',
                  'recurrent_nonlinear_out_proj_size': None,
                  'recurrent_peepholes': True,
                  'recurrent_to_output_skip_connections': False,
                  'recurrent_unit': 'CudnnLSTM',
                  'residual_connections_every_n_layers': None,
                  'tertiary_output': 'linear_alphabet'},
 'computing': {'allow_gpu_growth': False,
               'default_device': '',
               'fill_gpu': False,
               'functions_on_devices': {'/cpu:0': ['point_to_coordinate']},
               'gpu_fraction': 1.0,
               'num_cpus': 4,
               'num_reconstruction_fragments': 6,
               'num_reconstruction_parallel_iters': 4,
               'num_recurrent_parallel_iters': 1,
               'num_recurrent_shards': 1},
 'curriculum': {'base': 100.0,
                'behavior': None,
                'change_num_iterations': 5,
                'loss_history_subgroup': 'all',
                'mode': None,
                'rate': 0.002,
                'sharpness': 20.0,
                'slope': 1.0,
                'threshold': 5.0,
                'update_loss_history': False},
 'initialization': {'alphabet_init': {'dist': 'uniform', 'range': 3.14159},
                    'alphabet_seed': None,
                    'angle_shift': [0.0, 0.0, 0.0],
                    'dropout_seed': None,
                    'evolutionary_multiplier': 1.0,
                    'graph_seed': 426,
                    'queue_seed': None,
                    'recurrent_forget_bias': 1.0,
                    'recurrent_init': {'base': {'dist': 'uniform',
                                                'range': 0.01},
                                       'bias': {'dist': 'uniform',
                                                'range': 0}},
                    'recurrent_nonlinear_out_proj_init': {'base': {},
                                                          'bias': {}},
                    'recurrent_nonlinear_out_proj_seed': None,
                    'recurrent_out_proj_init': {'base': {'dist': 'uniform',
                                                         'range': 0.01},
                                                'bias': {'dist': 'uniform',
                                                         'range': 0}},
                    'recurrent_out_proj_seed': None,
                    'recurrent_seed': None,
                    'zoneout_seed': None},
 'io': {'alphabet_file': None,
        'checkpoint_every_n_hours': 24,
        'checkpoints_directory': '../models/RGN12/runs/CASP12/ProteinNet12Thinning90/checkpoints/',
        'data_files': None,
        'data_files_glob': '../models/RGN12/data/ProteinNet12Thinning90/training/[!a-z]*',
        'detailed_logs': True,
        'evaluation_sub_groups': ['10', '20', '30', '40', '50', '70', '90'],
        'log_alphabet': True,
        'log_model_summaries': True,
        'logs_directory': '../models/RGN12/runs/CASP12/ProteinNet12Thinning90/logs/',
        'max_checkpoints': None,
        'name': 'training',
        'num_edge_residues': 0,
        'num_evo_entries': 42},
 'loss': {'atoms': 'c_alpha',
          'batch_dependent_normalization': True,
          'include': True,
          'tertiary_normalization': 'first',
          'tertiary_weight': 1.0},
 'optimization': {'alphabet_temperature': 1.0,
                  'batch_size': 32,
                  'beta1': 0.95,
                  'beta2': 0.99,
                  'decay': 0.9,
                  'epsilon': 1e-07,
                  'gradient_threshold': 5.0,
                  'initial_accumulator_value': 0.1,
                  'learning_rate': 0.0001,
                  'momentum': 0.0,
                  'num_epochs': 100000,
                  'num_steps': 700,
                  'optimizer': 'adam',
                  'recurrent_threshold': None,
                  'rescale_behavior': 'norm_rescaling'},
 'queueing': {'batch_queue_capacity': 10000,
              'bucket_boundaries': None,
              'file_queue_capacity': 1000,
              'min_after_dequeue': 500,
              'num_evaluation_invocations': 1,
              'shuffle': True},
 'regularization': {'alphabet_keep_probability': 1.0,
                    'alphabet_normalization': None,
                    'recurrent_input_keep_probability': [0.5, 0.5],
                    'recurrent_keep_probability': 1.0,
                    'recurrent_layer_normalization': False,
                    'recurrent_memory_zonein_probability': 1.0,
                    'recurrent_nonlinear_out_proj_normalization': None,
                    'recurrent_output_keep_probability': 1.0,
                    'recurrent_state_zonein_probability': 1.0,
                    'recurrent_variational_dropout': False}}

*** weighted testing evaluation configuration ***
{'architecture': {'all_to_all_peepholes': False,
                  'all_to_recurrent_skip_connections': False,
                  'alphabet_size': 60,
                  'alphabet_trainable': True,
                  'bidirectional': True,
                  'first_residual_connection_from_nth_layer': 1,
                  'higher_order_layers': True,
                  'include_dihedrals_between_layers': False,
                  'include_evolutionary': True,
                  'include_primary': True,
                  'include_recurrent_outputs_between_layers': True,
                  'input_to_recurrent_skip_connections': False,
                  'recurrent_layer_size': [800, 800],
                  'recurrent_nonlinear_out_proj_function': 'tanh',
                  'recurrent_nonlinear_out_proj_size': None,
                  'recurrent_peepholes': True,
                  'recurrent_to_output_skip_connections': False,
                  'recurrent_unit': 'CudnnLSTM',
                  'residual_connections_every_n_layers': None,
                  'tertiary_output': 'linear_alphabet'},
 'computing': {'allow_gpu_growth': False,
               'default_device': '',
               'fill_gpu': False,
               'functions_on_devices': {'/cpu:0': ['point_to_coordinate']},
               'gpu_fraction': 1.0,
               'num_cpus': 4,
               'num_reconstruction_fragments': 6,
               'num_reconstruction_parallel_iters': 4,
               'num_recurrent_parallel_iters': 1,
               'num_recurrent_shards': 1},
 'curriculum': {'base': 100.0,
                'behavior': None,
                'change_num_iterations': 5,
                'loss_history_subgroup': 'all',
                'mode': None,
                'rate': 0.002,
                'sharpness': 20.0,
                'slope': 1.0,
                'threshold': 5.0,
                'update_loss_history': False},
 'initialization': {'alphabet_init': {'dist': 'uniform', 'range': 3.14159},
                    'alphabet_seed': None,
                    'angle_shift': [0.0, 0.0, 0.0],
                    'dropout_seed': None,
                    'evolutionary_multiplier': 1.0,
                    'graph_seed': 426,
                    'queue_seed': None,
                    'recurrent_forget_bias': 1.0,
                    'recurrent_init': {'base': {'dist': 'uniform',
                                                'range': 0.01},
                                       'bias': {'dist': 'uniform',
                                                'range': 0}},
                    'recurrent_nonlinear_out_proj_init': {'base': {},
                                                          'bias': {}},
                    'recurrent_nonlinear_out_proj_seed': None,
                    'recurrent_out_proj_init': {'base': {'dist': 'uniform',
                                                         'range': 0.01},
                                                'bias': {'dist': 'uniform',
                                                         'range': 0}},
                    'recurrent_out_proj_seed': None,
                    'recurrent_seed': None,
                    'zoneout_seed': None},
 'io': {'alphabet_file': None,
        'checkpoint_every_n_hours': 24,
        'checkpoints_directory': None,
        'data_files': None,
        'data_files_glob': '../models/RGN12/data/ProteinNet12Thinning90/testing/*.tfrecord',
        'detailed_logs': True,
        'evaluation_sub_groups': ['10', '20', '30', '40', '50', '70', '90'],
        'log_alphabet': True,
        'log_model_summaries': True,
        'logs_directory': None,
        'max_checkpoints': None,
        'name': 'evaluation_wt_testing',
        'num_edge_residues': 0,
        'num_evo_entries': 42},
 'loss': {'atoms': 'c_alpha',
          'batch_dependent_normalization': True,
          'include': False,
          'tertiary_normalization': 'first',
          'tertiary_weight': 1.0},
 'optimization': {'alphabet_temperature': 1.0,
                  'batch_size': 1,
                  'beta1': 0.95,
                  'beta2': 0.99,
                  'decay': 0.9,
                  'epsilon': 1e-07,
                  'gradient_threshold': 5.0,
                  'initial_accumulator_value': 0.1,
                  'learning_rate': 0.0001,
                  'momentum': 0.0,
                  'num_epochs': 1,
                  'num_steps': 700,
                  'optimizer': 'adam',
                  'recurrent_threshold': None,
                  'rescale_behavior': 'norm_rescaling'},
 'queueing': {'batch_queue_capacity': 300,
              'bucket_boundaries': None,
              'file_queue_capacity': 10,
              'min_after_dequeue': 10,
              'num_evaluation_invocations': 1,
              'shuffle': False},
 'regularization': {'alphabet_keep_probability': 1.0,
                    'alphabet_normalization': None,
                    'recurrent_input_keep_probability': [0.5, 0.5],
                    'recurrent_keep_probability': 1.0,
                    'recurrent_layer_normalization': False,
                    'recurrent_memory_zonein_probability': 1.0,
                    'recurrent_nonlinear_out_proj_normalization': None,
                    'recurrent_output_keep_probability': 1.0,
                    'recurrent_state_zonein_probability': 1.0,
                    'recurrent_variational_dropout': False}}
2019-07-02 21:39:54.072394: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-02 21:39:54.085506: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-07-02 21:39:54.085564: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: sequence-analysis
2019-07-02 21:39:54.085573: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: sequence-analysis
2019-07-02 21:39:54.085609: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 418.67.0
2019-07-02 21:39:54.085641: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 418.67.0
2019-07-02 21:39:54.085648: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 418.67.0
Traceback (most recent call last):
  File "model/protling.py", line 527, in <module>
    while loop(args): pass
  File "model/protling.py", line 379, in loop
    session = models['training'].start(models.values())
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 450, in _start
    self._saver.restore(session, latest_checkpoint)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1582, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU,XLA_CPU,XLA_GPU], Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]

     [[node RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py:1251)  = CudnnRNNCanonicalToParams[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", num_params=8, rnn_mode="lstm", seed=426, seed2=4497](RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_layers, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_units, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/input_size, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_1, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_2, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_3, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_4, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_5, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_6, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_7, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_8, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_9, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_10, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_11, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_12, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_13, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_14, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_15)]]

Caused by op u'RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams', defined at:
  File "model/protling.py", line 527, in <module>
    while loop(args): pass
  File "model/protling.py", line 301, in loop
    models.update({'training': RGNModel('training', configs['training'])})
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 114, in __init__
    self._create_graph(mode, self.config)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 200, in _create_graph
    recurrent_outputs, recurrent_states = _higher_recurrence(mode, recurrence_config, inputs, num_stepss, alphabet=alphabet)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 695, in _higher_recurrence
    layer_recurrent_outputs, layer_recurrent_states = _recurrence(mode, layer_config, layer_inputs, num_stepss)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 789, in _recurrence
    outputs_directed, (_, states_directed) = rnn(inputs_directed, training=is_training)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 374, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 746, in __call__
    self.build(input_shapes)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 352, in build
    opaque_params_t = self._canonical_to_opaque(weights, biases)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 474, in _canonical_to_opaque
    direction=self._direction)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1251, in cudnn_rnn_canonical_to_opaque_params
    name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 642, in cudnn_rnn_canonical_to_params
    name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU,XLA_CPU,XLA_GPU], Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]

     [[node RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py:1251)  = CudnnRNNCanonicalToParams[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", num_params=8, rnn_mode="lstm", seed=426, seed2=4497](RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_layers, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_units, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/input_size, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_1, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_2, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_3, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_4, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_5, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_6, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_7, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_8, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_9, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_10, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_11, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_12, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_13, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_14, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_15)]]

I greatly appreciate your time in helping to get this working on my end!

ecvgit commented 5 years ago

Hi @murakdar -- were you able to fix this issue?

murakdar commented 5 years ago

Hello @ecvgit. No, this issue remains unresolved.

alquraishi commented 5 years ago

Hi @murakdar, can you try specifying the GPU explicitly using -g0?

ecvgit commented 5 years ago

I was able to resolve this error. I think it happens because you are not using a compatible CUDNN version. I was able to use TF 12 with CUDNN 7.9.0 and CUDA 9.

murakdar commented 5 years ago

Hello @alquraishi; adding -g0 helped, but now the problem is that I don't get any *.tertiary or *.recurrent_states output files, and the command ends with no feedback about why.

Here are the commands I tried and their output:

First, with python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing -g0, the log file shows:

<...warnings and configuration snipped; similar to first comment...>
2019-07-31 15:15:20.614840: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-31 15:15:21.465724: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-31 15:15:21.466352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-31 15:15:21.466611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-31 15:15:35.049897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 15:15:35.049960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-31 15:15:35.049968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-31 15:15:35.050107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15079 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2019-07-31 15:15:35.856331: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 14.73G (15812263936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:454: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.

To get rid of the resulting memory issue, I tried again with python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing -g0 --gpu_fraction 0.9, which produced the following log:

<...warnings and configuration snipped; similar to first comment...>
2019-07-31 21:18:25.896157: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-31 21:18:26.093152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-31 21:18:26.093743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-31 21:18:26.093764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-31 21:18:26.558373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 21:18:26.558445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-31 21:18:26.558455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-31 21:18:26.558575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13571 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:454: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.

It stops running after ~15 seconds. The directory ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/11/outputsTesting/ gets created, but it is empty. I confirmed that no output files are generated anywhere else with a find command sorted by modification time. Other values of the --gpu_fraction do not help.

Any further ideas would be greatly appreciated.

@ecvgit: I am presently using cuDNN 7.1.4. In my first comment, I believe I was using cuDNN 7.6.1. I tried downgrading to fix the issue but at some point got the error E tensorflow/stream_executor/cuda/cuda_dnn.cc:363] Loaded runtime CuDNN library: 7.0.5 but source was compiled with: 7.1.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.. So I ultimately settled on version 7.1.4 to ensure compatibility. Edited to add: no difference when using cuDNN 7.6.1.

ecvgit commented 5 years ago

Could you try running it for CASP7?

ecvgit commented 5 years ago

@alquraishi Is it possible to share the .tertiary files for the models reported in the paper? I was able to generate the .tertiary files, but the DRMSD does not match -- which makes it hard to figure out if there is something wrong in my DRMSD computation vs using the wrong .tertiary files.

murakdar commented 5 years ago

Could you try running it for CASP7?

Tried, still the same behavior. @ecvgit, if I understand correctly, you have been able to run new predictions with the pre-trained model; could you perhaps share an example FASTA sequence file, corresponding .tfrecord file, and configuration file that I could drop in to one of the pre-trained models?

I did some further debugging and found that I'm hitting tf.errors.OutOfRangeError in the main loop. It's being thrown from RGNModel.predict at https://github.com/aqlaboratory/rgn/blob/0133213eea9aa95900d1f16c0c6b9febbeb394cb/model/model.py#L320-L321, which is ultimately calling a tf.Session.run() on the TF ops here. The TF ops being run (i.e. self._prediction_ops) look like this:

{'num_stepss': <tf.Tensor 'RGN/evaluation_wt_testing/num_stepss:0' shape=(1,) dtype=int32>,
 'ids': <tf.Tensor 'RGN/evaluation_wt_testing/ids:0' shape=(1,) dtype=string>,
 'coordinates': <tf.Tensor 'RGN/evaluation_wt_testing/point_to_coordinate:0' shape=(?, 1, 3) dtype=float32>,
 'recurrent_states': <tf.Tensor 'RGN/evaluation_wt_testing/concat:0' shape=(?, 3200) dtype=float32>}

For what it's worth, here's the complete traceback for running an individual op:

(Pdb) session.run(ops['num_stepss'])
*** OutOfRangeError: PaddingFIFOQueue '_3_RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
     [[node RGN/evaluation_wt_testing/batching_queue (defined at /home/dariusz/structure/aqlaboratory/rgn/model/model.py:549)  = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue, RGN/evaluation_wt_testing/batching_queue/n)]]
     [[{{node RGN/evaluation_wt_testing/batching_queue/_169}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4_RGN/evaluation_wt_testing/batching_queue", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op u'RGN/evaluation_wt_testing/batching_queue', defined at:
  File "model/protling.py", line 529, in <module>
    while loop(args): pass
  File "model/protling.py", line 337, in loop
    models.update({'eval_wt_test': RGNModel('evaluation', configs['eval_wt_test'])})
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 114, in __init__
    self._create_graph(mode, self.config)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 179, in _create_graph
    ids, primaries, evolutionaries, secondaries, tertiaries, masks, num_stepss = _dataflow(dataflow_config, max_length)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 549, in _dataflow
    inputs = read_protein(file_queue, max_length, config['num_edge_residues'], config['num_evo_entries'])
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 1074, in maybe_batch
    name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 787, in _batch
    dequeued = queue.dequeue_many(batch_size, name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 478, in dequeue_many
    self._queue_ref, n=n, component_types=self._dtypes, name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3487, in queue_dequeue_many_v2
    component_types=component_types, timeout_ms=timeout_ms, name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): PaddingFIFOQueue '_3_RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
     [[node RGN/evaluation_wt_testing/batching_queue (defined at /home/dariusz/structure/aqlaboratory/rgn/model/model.py:549)  = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue, RGN/evaluation_wt_testing/batching_queue/n)]]
     [[{{node RGN/evaluation_wt_testing/batching_queue/_169}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4_RGN/evaluation_wt_testing/batching_queue", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

ecvgit commented 5 years ago

I was able to run the predictions on the proteinnet test set. I didn't make any changes to the config file. Just extracted RGN7.tar.gz and used the following command. python protling.py RGN7/runs/CASP7/ProteinNet7Thinning90/configuration -d RGN7 -p -e weighted_testing -g 0

murakdar commented 5 years ago

I am now able to run predictions using the default configuration file as indicated -- thank you, @ecvgit and @alquraishi.

However, I am still unable to run predictions of a single new sequence.

The queue/range error in my last comment suggests my problem relates to the .tfrecord file output from the convert_to_tfrecord.py script.

Shall I continue here, or open a separate issue for that? (I'm tempted to prefer the latter, since the g0 option does enable me to run and load on GPU.)

aqlaboratory / rgn

difficulty running/loading model on GPU #16