Open murakdar opened 5 years ago
Hi @murakdar -- were you able to fix this issue?
Hello @ecvgit. No, this issue remains unresolved.
Hi @murakdar, can you try specifying the GPU explicitly using -g0
?
I was able to resolve this error. I think it happens because you are not using a compatible CUDNN version. I was able to use TF 12 with CUDNN 7.9.0 and CUDA 9.
Hello @alquraishi; adding -g0
helped, but now the problem is that I don't get any *.tertiary
or *.recurrent_states
output files, and the command ends with no feedback about why.
Here are the commands I tried and their output:
First, with python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing -g0
, the log file shows:
<...warnings and configuration snipped; similar to first comment...>
2019-07-31 15:15:20.614840: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-31 15:15:21.465724: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-31 15:15:21.466352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-31 15:15:21.466611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-31 15:15:35.049897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 15:15:35.049960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-07-31 15:15:35.049968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-07-31 15:15:35.050107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15079 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2019-07-31 15:15:35.856331: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 14.73G (15812263936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:454: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
To get rid of the resulting memory issue, I tried again with python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing -g0 --gpu_fraction 0.9
, which produced the following log:
<...warnings and configuration snipped; similar to first comment...>
2019-07-31 21:18:25.896157: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-31 21:18:26.093152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-31 21:18:26.093743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-31 21:18:26.093764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-31 21:18:26.558373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 21:18:26.558445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-07-31 21:18:26.558455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-07-31 21:18:26.558575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13571 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:454: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
It stops running after ~15 seconds. The directory ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/11/outputsTesting/
gets created, but it is empty. I confirmed that no output files are generated anywhere else with a find
command sorted by modification time. Other values of the --gpu_fraction
do not help.
Any further ideas would be greatly appreciated.
@ecvgit: I am presently using cuDNN 7.1.4. In my first comment, I believe I was using cuDNN 7.6.1. I tried downgrading to fix the issue but at some point got the error E tensorflow/stream_executor/cuda/cuda_dnn.cc:363] Loaded runtime CuDNN library: 7.0.5 but source was compiled with: 7.1.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
. So I ultimately settled on version 7.1.4 to ensure compatibility. Edited to add: no difference when using cuDNN 7.6.1.
Could you try running it for CASP7?
@alquraishi Is it possible to share the .tertiary files for the models reported in the paper? I was able to generate the .tertiary files, but the DRMSD does not match -- which makes it hard to figure out if there is something wrong in my DRMSD computation vs using the wrong .tertiary files.
Could you try running it for CASP7?
Tried, still the same behavior. @ecvgit, if I understand correctly, you have been able to run new predictions with the pre-trained model; could you perhaps share an example FASTA sequence file, corresponding .tfrecord file, and configuration file that I could drop in to one of the pre-trained models?
I did some further debugging and found that I'm hitting tf.errors.OutOfRangeError
in the main loop. It's being thrown from RGNModel.predict
at https://github.com/aqlaboratory/rgn/blob/0133213eea9aa95900d1f16c0c6b9febbeb394cb/model/model.py#L320-L321, which is ultimately calling a tf.Session.run()
on the TF ops here. The TF ops being run (i.e. self._prediction_ops
) look like this:
{'num_stepss': <tf.Tensor 'RGN/evaluation_wt_testing/num_stepss:0' shape=(1,) dtype=int32>,
'ids': <tf.Tensor 'RGN/evaluation_wt_testing/ids:0' shape=(1,) dtype=string>,
'coordinates': <tf.Tensor 'RGN/evaluation_wt_testing/point_to_coordinate:0' shape=(?, 1, 3) dtype=float32>,
'recurrent_states': <tf.Tensor 'RGN/evaluation_wt_testing/concat:0' shape=(?, 3200) dtype=float32>}
For what it's worth, here's the complete traceback for running an individual op:
(Pdb) session.run(ops['num_stepss'])
*** OutOfRangeError: PaddingFIFOQueue '_3_RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[node RGN/evaluation_wt_testing/batching_queue (defined at /home/dariusz/structure/aqlaboratory/rgn/model/model.py:549) = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue, RGN/evaluation_wt_testing/batching_queue/n)]]
[[{{node RGN/evaluation_wt_testing/batching_queue/_169}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4_RGN/evaluation_wt_testing/batching_queue", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op u'RGN/evaluation_wt_testing/batching_queue', defined at:
File "model/protling.py", line 529, in <module>
while loop(args): pass
File "model/protling.py", line 337, in loop
models.update({'eval_wt_test': RGNModel('evaluation', configs['eval_wt_test'])})
File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 114, in __init__
self._create_graph(mode, self.config)
File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 179, in _create_graph
ids, primaries, evolutionaries, secondaries, tertiaries, masks, num_stepss = _dataflow(dataflow_config, max_length)
File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 549, in _dataflow
inputs = read_protein(file_queue, max_length, config['num_edge_residues'], config['num_evo_entries'])
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 306, in new_func
return func(*args, **kwargs)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 1074, in maybe_batch
name=name)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 787, in _batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 478, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3487, in queue_dequeue_many_v2
component_types=component_types, timeout_ms=timeout_ms, name=name)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
OutOfRangeError (see above for traceback): PaddingFIFOQueue '_3_RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[node RGN/evaluation_wt_testing/batching_queue (defined at /home/dariusz/structure/aqlaboratory/rgn/model/model.py:549) = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue, RGN/evaluation_wt_testing/batching_queue/n)]]
[[{{node RGN/evaluation_wt_testing/batching_queue/_169}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4_RGN/evaluation_wt_testing/batching_queue", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
I was able to run the predictions on the proteinnet test set. I didn't make any changes to the config file. Just extracted RGN7.tar.gz and used the following command. python protling.py RGN7/runs/CASP7/ProteinNet7Thinning90/configuration -d RGN7 -p -e weighted_testing -g 0
I am now able to run predictions using the default configuration file as indicated -- thank you, @ecvgit and @alquraishi.
However, I am still unable to run predictions of a single new sequence.
The queue/range error in my last comment suggests my problem relates to the .tfrecord
file output from the convert_to_tfrecord.py
script.
Shall I continue here, or open a separate issue for that? (I'm tempted to prefer the latter, since the g0
option does enable me to run and load on GPU.)
I have been trying to predict the structure of a new sequence using the available pre-trained model (CASP11), but I've so far been unsuccessful in running the model. Note that I was equally unsuccessful in training a new model, with similar errors as below, but I will frame this in the context of the prediction task.
First, I successfully followed the input preparation steps provided in the README (i.e. using HMMER and convert scripts). Then, I slightly modified the configuration file to locate the
.tfrecord
files to be tested. From inside thergn
directory, I runpython model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing
.The resulting error is:
A complete log file is found at the end of this message. Training a new model based on the ProteinNet data sets also doesn't work for me, with a similar error. I suspect the underlying culprit is the following line:
However, I know that the machine does have a working GPU on which other applications can run. For example, the command
python -c 'import tensorflow as tf; sess = tf.Session(); devices = sess.list_devices(); print(devices)'
works as expected; the resulting output is:I am using TensorFlow 1.12.0 with CUDA 9.0 on Python 2.7.12. Trying with or without
export CUDA_VISIBLE_DEVICES=0
had no effect. I'd be happy to provide any additional information that could be useful.Finally, I'm not sure if it's relevant to this particular issue, but I was also unable to successfully run
python tests.py
(from withinrgn/models
). (This is after extractingtests_data.zip
and adjustingbase_dir
on line 20 accordingly.) After some deprecation warnings, here is the output from the first two unit tests:The remaining tests all raise the same
RuntimeError: Model already started; cannot create new objects.
Moreover, running an individual test doesn't seem to produce any useful output:Here is the complete output log file located in
../models/RGN12/logs/CASP12.log
:I greatly appreciate your time in helping to get this working on my end!