Closed BowieHsu closed 7 years ago
Hi, BowieHsu. Actually, you don't have to modify the code for single-gpu training, only one gpu_id should be OK:
./scripts/train ${gpu_id= 0 if you have only one gpu} ${batch_size_per_gpu} ${dataset}
As for the use of pretrained vgg model, if you have to use the vgg16 from SSD-tensorflow, you have to make sure that the variable names of two vgg models are the same, except for the topmost model name_scope like 'ssd_300'.
So, you need to modify two files:
init_fn = util.tf.get_init_fn(checkpoint_path = FLAGS.checkpoint_path, train_dir = FLAGS.train_dir,
ignore_missing_vars = FLAGS.ignore_missing_vars, checkpoint_exclude_scopes = FLAGS.checkpoint_exclude_scopes)
The signature of util.tf.get_init_fn
is :
def get_init_fn(checkpoint_path, train_dir, ignore_missing_vars = False,
checkpoint_exclude_scopes = None, model_name = None, checkpoint_model_scope = None):
So you need to add an argument checkpoint_model_scope
if you use the name scope like ssd_300
for the whole vgg model, you need to change the line of code to:
init_fn = util.tf.get_init_fn(checkpoint_path = FLAGS.checkpoint_path, train_dir = FLAGS.train_dir,
ignore_missing_vars = FLAGS.ignore_missing_vars, checkpoint_exclude_scopes = FLAGS.checkpoint_exclude_scopes, checkpoint_model_scope = 'ssd_300')
thanks, currently I am trying to fine tune on your model follow those steps:
I guess i should modified tfrecords produce config or Slim.Dataprovider config, what is your opinion?
Could you please give a more detail error description?
Hi, dengdan, i have solved my problem and now i am trying to modified seglink based on my project requirement, many thanks, you can close the issue.
Good!
Hi, @dengdan . When converting initialization TF-file using this script https://github.com/dengdan/seglink/blob/master/caffe_to_tensorflow.py and this model VGG_coco_SSD_512x512_iter_360000.caffemodel, do I need any other file?
Because when I try to convert the VGG-SSD-COCO model to reproduct the training procedure of ICDAR2015,I got the error below:
Traceback (most recent call last):
File "caffe_to_tensorflow.py", line 113, in <module>
check_var(name)
File "caffe_to_tensorflow.py", line 108, in check_var
np.testing.assert_almost_equal(actual = np.mean(caffe_weights), desired = np.mean(tf_weights.eval(session)))
AttributeError: 'NoneType' object has no attribute 'eval'
It seems like this problem has something to do with this operation: # check all vgg and extra layer weights/biases have been converted in a right way.
Do you know what's wrong here? And how can I solve this problem? Thank you!
Also, I try another way: like you wrote in the https://github.com/dengdan/seglink/blob/master/train_seglink.py:
If there are checkpoints in train_dir, this config will be ignored.
So I put the checkpoints of yours (model.ckpt-136750) in my train_dir, but still ,it's not working. Looking forward for your reply of solution. Thank you.
@sdsy888 I have trained model successfully, could you supply some figures about situation?
sure, thank you. Please see below:
This is what happens when I try to convert the caffe_model to tensorflow initialization file. I do get three files but those file is not complete apparently.
Beside, when I try to use the existing TF checkpoint file to initialize the network, this is what happens:
# =========================================================================== # # Training flags: # =========================================================================== # {'batch_size': 18, 'checkpoint_exclude_scopes': None, 'checkpoint_path': '/home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink', 'dataset_dir': '/home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR', 'dataset_name': 'icdar2015', 'dataset_split_name': 'train', 'gpu_memory_fraction': -1.0, 'ignore_missing_vars': True, 'learning_rate': 0.0001, 'link_cls_loss_weight': 1.0, 'log_every_n_steps': 1, 'max_number_of_steps': 1000000, 'model_name': 'seglink_vgg', 'momentum': 0.9, 'moving_average_decay': 0.9999, 'num_gpus': 1, 'num_preprocessing_threads': 1, 'num_readers': 1, 'seg_loc_loss_weight': 1.0, 'train_dir': '/home/neo/PycharmProjects/seglink/models/seglink_Train', 'train_image_height': 384, 'train_image_width': 384, 'train_with_ignored': False, 'using_moving_average': False, 'weight_decay': 0.0005} # =========================================================================== # # seglink net parameters: # =========================================================================== # 'max_neg_pos_ratio=3' 'num_links=27660' 'seg_loc_loss_weight=1.0' 'link_conf_threshold=0.5' 'num_clones=1' "gpus=['/gpu:0']" 'prior_scaling=[0.2, 0.5, 0.2, 0.5, 20.0]' 'max_height_ratio=1.5' 'train_with_ignored=False' 'image_shape=(384, 384)' "feat_layers=['conv4_3', 'fc7', 'conv6_2', 'conv7_2', 'conv8_2', 'conv9_2']" 'anchor_offset=0.5' 'default_anchors=[[ 4. 4. 12. 12.]\n [ 12. 4. 12. 12.]\n [ 20. 4. 12. 12.]\n ..., \n [ 288. 96. 288. 288.]\n [ 96. 288. 288. 288.]\n [ 288. 288. 288. 288.]]' '__file__=/home/neo/PycharmProjects/seglink/config.pyc' 'batch_size=18' 'batch_size_per_gpu=18' 'link_cls_loss_weight=1.0' '__name__=config' 'anchor_scale_gamma=1.5' 'data_format=NHWC' "clone_scopes=['clone_0']" 'num_anchors=3073' 'seg_conf_threshold=0.5' # =========================================================================== # # Training | Evaluation dataset files: # =========================================================================== # ['/home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord'] INFO:tensorflow:Fine-tuning from /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink. Ignoring missing vars: True ######## 2017-09-08 08:54:50.487570: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: Failed precondition: /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: perhaps your file is in a different file format and you need to use a different restore operator? Traceback (most recent call last): File "train_seglink.py", line 282, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "train_seglink.py", line 278, in main train(train_op) File "train_seglink.py", line 253, in train ignore_missing_vars = FLAGS.ignore_missing_vars, checkpoint_exclude_scopes = FLAGS.checkpoint_exclude_scope) File "/home/neo/pylib/src/util/tf.py", line 122, in get_init_fn ignore_missing_vars=ignore_missing_vars) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 638, in assign_from_checkpoint_fn reader = pywrap_tensorflow.NewCheckpointReader(model_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 110, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: Failed precondition: /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: perhaps your file is in a different file format and you need to use a different restore operator?
I'm not familiar with tensorflow, maybe there are some low-level mistakes. So please point it out If you know what these problems are all about.
Thank you so much!
BTW, I've found the team of Xiang Bai had released their code(even some parts are packed in .so format), I'm working on it.
caffe_to_tensorflow
It's my fault.
The code in https://github.com/dengdan/seglink/blob/master/caffe_to_tensorflow.py#L107 is used to make sure that the convert process works well, by comparing the values from caffemodel and tf model. However, I abandoned conv10_*
layers in later training. So remove them in the variable layers_to_convert
in https://github.com/dengdan/seglink/blob/master/caffe_to_tensorflow.py#L73 .train_dir
, and create a checkpoint
file for it.Thank you for ur reply. I succeed in converting the caffemodel. But when I try to train the network use your method, it shows this:
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/neo/PycharmProjects/seglink/models/seglink_Train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord
[[Node: icdar2015_data_provider/parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](icdar2015_data_provider/parallel_read/TFRecordReaderV2, icdar2015_data_provider/parallel_read/filenames)]]
INFO:tensorflow:global_step/sec: 0
2017-09-08 11:20:20.108605: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: clone_0/fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](prefetch_queue/fifo_queue)]]
2017-09-08 11:20:20.108671: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: clone_0/fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](prefetch_queue/fifo_queue)]]
2017-09-08 11:20:20.108695: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
......
Traceback (most recent call last):
File "train_seglink.py", line 282, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train_seglink.py", line 278, in main
train(train_op)
File "train_seglink.py", line 267, in train
session_config = sess_config
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 759, in train
sv.saver.save(sess, sv.save_path, global_step=sv.global_step)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord
[[Node: icdar2015_data_provider/parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](icdar2015_data_provider/parallel_read/TFRecordReaderV2, icdar2015_data_provider/parallel_read/filenames)]]
The data files are generated using the scripts in the dir /dataset
, do you have any idea what this error means?
Thank you!
tensorflow.python.framework.errors_impl.NotFoundError: /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord
Do you haveicdar2015_train
in this directory?
Yes, I convert the icdar2015_train
images in that dir to tfrecord
using your code in dataset.
Are you sure? Execute the command in your terminal to make sure of it:
ls /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord
My fault.
The TF data file generating code outputs the icdar2015_train_tfrecord
while the network try to find icdar2015_train.tfrecord
.
['/home/heter/dataset/SSD-tf/ICDAR/icdar2015_train.tfrecord']
INFO:tensorflow:Fine-tuning from None. Ignoring missing vars: True
…… reader = pywrap_tensorflow.NewCheckpointReader(model_path) File "/home/heter/anaconda3/envs/pixel_link/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 110, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/home/heter/anaconda3/envs/pixel_link/lib/python2.7/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes (bytes_or_text,)) TypeError: Expected binary or unicode string, got None @dengdan
Hi, dengdan, thank you for your hard work, i am trying to train seglink model on my own datasets, i meet such situation:
Thank you again, your work is awesome.