aqlaboratory / rgn

Recurrent Geometric Networks for end-to-end differentiable learning of protein structure
MIT License
326 stars 87 forks source link

Restoring from checkpoint failed. #3

Closed kad-ecoli closed 5 years ago

kad-ecoli commented 6 years ago

I run tensorflow 1.11 on 64bit CentOS Linux 6.10. I downloaded pre-trained model RGN7.tar.gz, untar it to RGN7/, and run protling.py as

python2.7 ../rgn/model/protling.py ../rgn/configurations/CASP7.config -d RGN7 -p

The prediction apparently failed with the following complaint. Is this caused by mismatching tensorflow version?

WARNING:tensorflow:From ~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/training/input.py:187: __init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From ~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/training/input.py:187: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From ~/end2end/rgn/model/geom_ops.py:98: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
*** training configuration ***
{'architecture': {'all_to_all_peepholes': False,
                  'all_to_recurrent_skip_connections': False,
                  'alphabet_size': 60,
                  'alphabet_trainable': True,
                  'bidirectional': True,
                  'first_residual_connection_from_nth_layer': 1,
                  'higher_order_layers': True,
                  'include_dihedrals_between_layers': False,
                  'include_evolutionary': True,
                  'include_primary': True,
                  'include_recurrent_outputs_between_layers': True,
                  'input_to_recurrent_skip_connections': False,
                  'recurrent_layer_size': [800, 800],
                  'recurrent_nonlinear_out_proj_function': 'tanh',
                  'recurrent_nonlinear_out_proj_size': None,
                  'recurrent_peepholes': True,
                  'recurrent_to_output_skip_connections': False,
                  'recurrent_unit': 'CudnnLSTM',
                  'residual_connections_every_n_layers': None,
                  'tertiary_output': 'linear_alphabet'},
 'computing': {'allow_gpu_growth': False,
               'default_device': '',
               'fill_gpu': False,
               'functions_on_devices': {'/cpu:0': ['point_to_coordinate']},
               'gpu_fraction': 1.0,
               'num_cpus': 4,
               'num_reconstruction_fragments': 6,
               'num_reconstruction_parallel_iters': 4,
               'num_recurrent_parallel_iters': 1,
               'num_recurrent_shards': 1},
 'curriculum': {'base': 100.0,
                'behavior': None,
                'change_num_iterations': 5,
                'loss_history_subgroup': 'all',
                'mode': None,
                'rate': 0.002,
                'sharpness': 20.0,
                'slope': 1.0,
                'threshold': 5.0,
                'update_loss_history': False},
 'initialization': {'alphabet_init': {'dist': 'uniform', 'range': 3.14159},
                    'alphabet_seed': None,
                    'angle_shift': [0.0, 0.0, 0.0],
                    'dropout_seed': None,
                    'evolutionary_multiplier': 1.0,
                    'graph_seed': 254,
                    'queue_seed': None,
                    'recurrent_forget_bias': 1.0,
                    'recurrent_init': {'base': {'dist': 'uniform',
                                                'range': 0.01},
                                       'bias': {'dist': 'uniform',
                                                'range': 0}},
                    'recurrent_nonlinear_out_proj_init': {'base': {},
                                                          'bias': {}},
                    'recurrent_nonlinear_out_proj_seed': None,
                    'recurrent_out_proj_init': {'base': {'dist': 'uniform',
                                                         'range': 0.01},
                                                'bias': {'dist': 'uniform',
                                                         'range': 0}},
                    'recurrent_out_proj_seed': None,
                    'recurrent_seed': None,
                    'zoneout_seed': None},
 'io': {'alphabet_file': None,
        'checkpoint_every_n_hours': 24,
        'checkpoints_directory': 'RGN7/runs/CASP7/ProteinNet7Thinning90/checkpoints/',
        'data_files': None,
        'data_files_glob': 'RGN7/data/ProteinNet7Thinning90/training/[!a-z]*',
        'detailed_logs': False,
        'evaluation_sub_groups': ['10', '20', '30', '40', '50', '70', '90'],
        'log_alphabet': False,
        'log_model_summaries': True,
        'logs_directory': 'RGN7/runs/CASP7/ProteinNet7Thinning90/logs/',
        'max_checkpoints': None,
        'name': 'training',
        'num_edge_residues': 0,
        'num_evo_entries': 42},
 'loss': {'atoms': 'c_alpha',
          'batch_dependent_normalization': True,
          'include': True,
          'tertiary_normalization': 'first',
          'tertiary_weight': 1.0},
 'optimization': {'alphabet_temperature': 1.0,
                  'batch_size': 32,
                  'beta1': 0.95,
                  'beta2': 0.99,
                  'decay': 0.9,
                  'epsilon': 1e-07,
                  'gradient_threshold': 5.0,
                  'initial_accumulator_value': 0.1,
                  'learning_rate': 0.0001,
                  'momentum': 0.0,
                  'num_epochs': 100000,
                  'num_steps': 700,
                  'optimizer': 'adam',
                  'recurrent_threshold': None,
                  'rescale_behavior': 'norm_rescaling'},
 'queueing': {'batch_queue_capacity': 10000,
              'bucket_boundaries': None,
              'file_queue_capacity': 1000,
              'min_after_dequeue': 500,
              'num_evaluation_invocations': 1,
              'shuffle': True},
 'regularization': {'alphabet_keep_probability': 1.0,
                    'alphabet_normalization': None,
                    'recurrent_input_keep_probability': [0.5, 0.5],
                    'recurrent_keep_probability': 1.0,
                    'recurrent_layer_normalization': False,
                    'recurrent_memory_zonein_probability': 1.0,
                    'recurrent_nonlinear_out_proj_normalization': None,
                    'recurrent_output_keep_probability': 1.0,
                    'recurrent_state_zonein_probability': 1.0,
                    'recurrent_variational_dropout': False}}

*** weighted validation evaluation configuration ***
{'architecture': {'all_to_all_peepholes': False,
                  'all_to_recurrent_skip_connections': False,
                  'alphabet_size': 60,
                  'alphabet_trainable': True,
                  'bidirectional': True,
                  'first_residual_connection_from_nth_layer': 1,
                  'higher_order_layers': True,
                  'include_dihedrals_between_layers': False,
                  'include_evolutionary': True,
                  'include_primary': True,
                  'include_recurrent_outputs_between_layers': True,
                  'input_to_recurrent_skip_connections': False,
                  'recurrent_layer_size': [800, 800],
                  'recurrent_nonlinear_out_proj_function': 'tanh',
                  'recurrent_nonlinear_out_proj_size': None,
                  'recurrent_peepholes': True,
                  'recurrent_to_output_skip_connections': False,
                  'recurrent_unit': 'CudnnLSTM',
                  'residual_connections_every_n_layers': None,
                  'tertiary_output': 'linear_alphabet'},
 'computing': {'allow_gpu_growth': False,
               'default_device': '',
               'fill_gpu': False,
               'functions_on_devices': {'/cpu:0': ['point_to_coordinate']},
               'gpu_fraction': 1.0,
               'num_cpus': 4,
               'num_reconstruction_fragments': 6,
               'num_reconstruction_parallel_iters': 4,
               'num_recurrent_parallel_iters': 1,
               'num_recurrent_shards': 1},
 'curriculum': {'base': 100.0,
                'behavior': None,
                'change_num_iterations': 5,
                'loss_history_subgroup': 'all',
                'mode': None,
                'rate': 0.002,
                'sharpness': 20.0,
                'slope': 1.0,
                'threshold': 5.0,
                'update_loss_history': True},
 'initialization': {'alphabet_init': {'dist': 'uniform', 'range': 3.14159},
                    'alphabet_seed': None,
                    'angle_shift': [0.0, 0.0, 0.0],
                    'dropout_seed': None,
                    'evolutionary_multiplier': 1.0,
                    'graph_seed': 254,
                    'queue_seed': None,
                    'recurrent_forget_bias': 1.0,
                    'recurrent_init': {'base': {'dist': 'uniform',
                                                'range': 0.01},
                                       'bias': {'dist': 'uniform',
                                                'range': 0}},
                    'recurrent_nonlinear_out_proj_init': {'base': {},
                                                          'bias': {}},
                    'recurrent_nonlinear_out_proj_seed': None,
                    'recurrent_out_proj_init': {'base': {'dist': 'uniform',
                                                         'range': 0.01},
                                                'bias': {'dist': 'uniform',
                                                         'range': 0}},
                    'recurrent_out_proj_seed': None,
                    'recurrent_seed': None,
                    'zoneout_seed': None},
 'io': {'alphabet_file': None,
        'checkpoint_every_n_hours': 24,
        'checkpoints_directory': None,
        'data_files': None,
        'data_files_glob': 'RGN7/data/ProteinNet7Thinning90/validation/1',
        'detailed_logs': False,
        'evaluation_sub_groups': ['10', '20', '30', '40', '50', '70', '90'],
        'log_alphabet': False,
        'log_model_summaries': True,
        'logs_directory': None,
        'max_checkpoints': None,
        'name': 'evaluation_wt_validation',
        'num_edge_residues': 0,
        'num_evo_entries': 42},
 'loss': {'atoms': 'c_alpha',
          'batch_dependent_normalization': True,
          'include': False,
          'tertiary_normalization': 'first',
          'tertiary_weight': 1.0},
 'optimization': {'alphabet_temperature': 1.0,
                  'batch_size': 1,
                  'beta1': 0.95,
                  'beta2': 0.99,
                  'decay': 0.9,
                  'epsilon': 1e-07,
                  'gradient_threshold': 5.0,
                  'initial_accumulator_value': 0.1,
                  'learning_rate': 0.0001,
                  'momentum': 0.0,
                  'num_epochs': 1,
                  'num_steps': 700,
                  'optimizer': 'adam',
                  'recurrent_threshold': None,
                  'rescale_behavior': 'norm_rescaling'},
 'queueing': {'batch_queue_capacity': 300,
              'bucket_boundaries': None,
              'file_queue_capacity': 10,
              'min_after_dequeue': 10,
              'num_evaluation_invocations': 1,
              'shuffle': False},
 'regularization': {'alphabet_keep_probability': 1.0,
                    'alphabet_normalization': None,
                    'recurrent_input_keep_probability': [0.5, 0.5],
                    'recurrent_keep_probability': 1.0,
                    'recurrent_layer_normalization': False,
                    'recurrent_memory_zonein_probability': 1.0,
                    'recurrent_nonlinear_out_proj_normalization': None,
                    'recurrent_output_keep_probability': 1.0,
                    'recurrent_state_zonein_probability': 1.0,
                    'recurrent_variational_dropout': False}}
2018-10-28 20:59:04.218884: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
Traceback (most recent call last):
  File "../rgn/model/protling.py", line 532, in <module>
    while loop(args): pass
  File "../rgn/model/protling.py", line 384, in loop
    session = models['training'].start(models.values())
  File "~/end2end/rgn/model/model.py", line 448, in _start
    self._saver.restore(session, latest_checkpoint)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1574, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU], Registered kernels:
  <no registered kernels>

     [[{{node RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}} = CudnnRNNCanonicalToParams[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", num_params=8, rnn_mode="lstm", seed=254, seed2=4497](RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_layers, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_units, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/input_size, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_1, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_2, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_3, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_4, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_5, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_6, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_7, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_8, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_9, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_10, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_11, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_12, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_13, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_14, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_15)]]

Caused by op u'RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams', defined at:
  File "../rgn/model/protling.py", line 532, in <module>
    while loop(args): pass
  File "../rgn/model/protling.py", line 306, in loop
    models.update({'training': RGNModel('training', configs['training'])})
  File "~/end2end/rgn/model/model.py", line 114, in __init__
    self._create_graph(mode, self.config)
  File "~/end2end/rgn/model/model.py", line 200, in _create_graph
    recurrent_outputs, recurrent_states = _higher_recurrence(mode, recurrence_config, inputs, num_stepss, alphabet=alphabet)
  File "~/end2end/rgn/model/model.py", line 693, in _higher_recurrence
    layer_recurrent_outputs, layer_recurrent_states = _recurrence(mode, layer_config, layer_inputs, num_stepss)
  File "~/end2end/rgn/model/model.py", line 787, in _recurrence
    outputs_directed, (_, states_directed) = rnn(inputs_directed, training=is_training)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 364, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 759, in __call__
    self.build(input_shapes)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 352, in build
    opaque_params_t = self._canonical_to_opaque(weights, biases)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 474, in _canonical_to_opaque
    direction=self._direction)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1251, in cudnn_rnn_canonical_to_opaque_params
    name=name)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 642, in cudnn_rnn_canonical_to_params
    name=name)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "~/end2end/miniconda2.7/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU], Registered kernels:
  <no registered kernels>

     [[{{node RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}} = CudnnRNNCanonicalToParams[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", num_params=8, rnn_mode="lstm", seed=254, seed2=4497](RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_layers, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_units, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/input_size, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_1, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_2, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_3, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_4, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_5, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_6, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_7, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_8, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_9, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_10, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_11, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_12, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_13, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_14, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_15)]]
alquraishi commented 6 years ago

Yes the variable names changed between TF 1.1 and 1.4, and so you will need a more recent TF to get it to work. Also note that the checkpointed models use cuDNN kernels.

kad-ecoli commented 6 years ago

There is some misunderstanding. I used TensorFlow 1.11 (>=1.4), not TensorFlow 1.1. TensorFlow 1.11 is one of the version with whom this repository is supposed to work, as stated in readme.

TensorFlow 1.11 is already the most recent TensorFlow I can get on anaconda as of this post.

alquraishi commented 6 years ago

My apologies--I misread your initial post as referring to TF 1.1. It's not a version compatibility issue then.

Judging by the error message, I'm guessing it's not finding the cuDNN kernels. Are you using an Nvidia GPU? The pre-trained models must be run on one, because training was done with the cuDNN LSTM kernels. TF does now support conversion between the cuDNN LSTMs and the vanilla TF ones, but I haven't implemented the functionality yet.

kad-ecoli commented 6 years ago

So the lack of CUDA is the main reason. I guess I need to try to covert the model to make it work with CPU.

alquraishi commented 6 years ago

Yes the cuDNN LSTM units are not currently being constructed in a way that makes them convertible between the CPU and GPU versions, but I know the latest TF supports conversion between the two. I will leave this issue open and see if I can get around to it myself, but if you make progress let me know as well!

amanchandra333 commented 6 years ago

I had the same issue and solved it by explicitly specifying the -g argument as 0. However, after the code runs to completion, where are the output files generated about the prediction?

alquraishi commented 6 years ago

That's unlikely to have worked. What do the logs say? Output should be in base/runs/runName/datasetName/...

rowancallahan commented 5 years ago

Hi @alquraishi I had the same error as @amanchandra333 using tensorflow 1.10 and was able to resolve it by setting the graphics card to zero and the gpu fraction to 0.8. The code ran to completion andwhen used on the CASP12 data set with the CASP12 configuration file. My output of nvidia-smi was

Fri Nov 16 18:02:19 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I ran protling as follows:

python rgn/model/protling.py rgn/configurations/CASP12.config -d CASP12/ -g 0 -f 0.9

The messages that were given at the end of the run from RGN12/log/CASP12.log after all of the model configuration data were

2018-11-15 21:10:02.399224: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-15 21:10:02.499942: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so r
eturning NUMA node zero
2018-11-15 21:10:02.500352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-15 21:10:02.500377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-15 21:10:02.811868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-15 21:10:02.811933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-11-15 21:10:02.811943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-11-15 21:10:02.812226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9152 MB memory) -> physical GPU (d
evice: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
WARNING:tensorflow:From /home/rlc343/rgn/model/model.py:452: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.

Unfortunately when I look in the folder RGN12/runs/CASP12/ProteinNet12Thinning90/1 there is only an error.log file which contains error and loss results, but no .tertiary files.

Any help would be appreciated! Thanks so much.

alquraishi commented 5 years ago

Hi @rowancallahan, are you trying to train a new model or just make predictions? If the latter then you need the -p option, and possibly -e depending on what set you want to make predictions for, e.g. -e weighted_testing if you want predictions for the test set.

rowancallahan commented 5 years ago

My apologies! I have been running with the -p parameter for predictions my query should have read python rgn/model/protling.py rgn/configurations/CASP12.config -d CASP12/ -g 0 -f 0.9 -p

I am trying to run for predictions.

alquraishi commented 5 years ago

Hmmm. Would you mind summarizing your directory structure? Is the data directory inside of RGN12? If so then RGN12 is your base directory and not CASP12. I.e. if this is what you have:

RGN12/runs/CASP12/ProteinNet12Thinning90/...
RGN12/data/ProteinNet12Thinning90/...

then I would pass RGN12 to -d and not CASP12. Also, try using the configuration that's in RGN12/runs/CASP12/ProteinNet12Thinning90/ just in case, although that shouldn't really matter.

Are you able to train new models from scratch and only prediction is not working? Or are you unsure? Also can you include the output from error.log?

Thanks!

rowancallahan commented 5 years ago

Hi @alquraishi I renamed RGN12 to CASP12 so my directory structure is

CASP12/runs/CASP12/ProteinNet12Thinning90/...
CASP12/data/ProteinNet12Thinning90/...

I tried renaming my folders back to RGN12 and rerunning with my directory structure as

RGN12/runs/CASP12/ProteinNet12Thinning90/...
RGN12/data/ProteinNet12Thinning90/...

I also tried using the configuration file that was listed in RGN12/runs/CASP12ProteinNet12Thinning90/configuration

Here are the sanity checks that I have performed so far.

After redownloading the RGN12 data and looking through the downloaded dataset it seems that some predictions are already made in the RGN12/runs/CASP12/ProteinNet12Thinning90/... folder. I checked which files were already created and which folders were being predicted. Before running any predictions, all folders except the folder named "1" contain an error.log file and a OutputsValidation subfolder which contains a list of .tertiary files.

However, after changing the SampleValidationGlob and trying to run prediction for a different protein the .tertiary files are not updated or changed. It seems like the model runs fine, and it appears to train. However, my current end goal is to take a large batch of PSSMs and MSAs and predict protein structures for visualization in PyMol.

Finally I tried deleting all of the numbered folders in RGN12/runs/CASP12/ProteinNet12Thinning90/... After doing this and rerunning I find that only folder 11 is recreated and that folder 11 now contains .tertiary and .recurrent_states files with no error.log file. Are these the novel predictions? Is it possible to construct a 3d structure for visualization using these files?

alquraishi commented 5 years ago

Yes you will generally see new predictions be saved in the highest number folder, because that's where the checkpoint is at (i.e. the training iteration of the model that is loaded when you try to make predictions.) And yes the .tertiary files contain the backbone coordinates of the newly predicted proteins. The triplets are x,y,z coordinates, and they alternate between the three backbone atoms (C_alpha, N, and C').

FACEkimi commented 5 years ago

Hi~ I have a few questions 1) the Usage says "This predicts the structures of the dataset specified in the configuration file. By default only the validation set is predicted, but this can be changed using the -e option." , so if I want to predict the test set, is it OK to write "python protling.py [configurationFilePath] -d [baseDirectory] -e TESTING_MODEL" ? 2) If I want to count the standard dRMSD you use for reporting accuracies in the , how can i do? Is it OK to use the tertiary Structure in the testing set and the outputs prediction to deal with it?

THANKS!

alquraishi commented 5 years ago

Hi @FACEkimi, and sorry for the delay.

  1. Yes, except you need to use -e weighted_testing and not -e TESTING_MODEL.
  2. Yes that is the set I compute the dRMSD on.
FACEkimi commented 5 years ago

Thanks for your reply~ But I have an another questions, I try to use the outputs prediction and the PDB files to count dRMSD, however I found that for almost all proteins, the amount of the output numbers is not the same as the amount in PDB files. For example, for 1AEP, the blackbone(CA+C'+N) in PDB has 4593 (3 is because x,y,z), but the tertiary has 4833, and for 1DZL, in PDB has 13653 and the tertiary has 15153, and for 1HI9, in PDB has 41103 but the tertiary has 822*3. I don't know why?

alquraishi commented 5 years ago

Hi @FACEkimi, the PDB files may contain multiple domains, and in some instances may having missing residues that are predicted by the model, which would result in a lack of a one-to-one correspondence. My suggestion would be to use the structures in the ProteinNet data set, as they are already formatted to be matched to the predicted ones.

uoda commented 5 years ago

Helo I am Rashid and doing master thesis on protein sequence to structure prediction. I tried according to the github instruction @alquraishi and also read the previous problem here.

I am also trying to make prediction of Predict sequences in ProteinNet TFRecords format using a trained model; I used the script as: python Machine_Learning/rgn-master/model/protling.py Machine_Learning/rgn-master/configurations/CASP7.config -d Machine_Learning/rgn-master/RGN7 -p -e weighted_testing script ran well and i did not get any error but i did not understand where is my output 3d structure and how i can visualize through chimera. Would you please help me to continue this work properly. Thanks