BowieHsu commented 7 years ago

Hi, dengdan, thank you for your hard work, i am trying to train seglink model on my own datasets, i meet such situation:

I only get one GPU card which is TITAN XP, i have rewrite your train scripts,but i get some warnings
which pretrain model should i prepare for trainning progress? i got imagenet vgg16 checkpoints from SSD-tensorflow project, does this which part of the code should i rewrite to train on this pretrain model?

Thank you again, your work is awesome.

dengdan commented 7 years ago

Hi, BowieHsu. Actually, you don't have to modify the code for single-gpu training, only one gpu_id should be OK:

./scripts/train ${gpu_id= 0 if you have only one gpu} ${batch_size_per_gpu} ${dataset}

As for the use of pretrained vgg model, if you have to use the vgg16 from SSD-tensorflow, you have to make sure that the variable names of two vgg models are the same, except for the topmost model name_scope like 'ssd_300'.
So, you need to modify two files:

nets/vgg16, change the names of my vgg model, which is the same to caffe vgg, but different from ssd-tensorflow.

modify the init_fn in train_seglink.py, line 248:

init_fn = util.tf.get_init_fn(checkpoint_path = FLAGS.checkpoint_path, train_dir = FLAGS.train_dir, 
                      ignore_missing_vars = FLAGS.ignore_missing_vars, checkpoint_exclude_scopes = FLAGS.checkpoint_exclude_scopes)

The signature of util.tf.get_init_fn is :

def get_init_fn(checkpoint_path, train_dir, ignore_missing_vars = False, 
            checkpoint_exclude_scopes = None, model_name = None, checkpoint_model_scope = None):

So you need to add an argument checkpoint_model_scope if you use the name scope like ssd_300 for the whole vgg model, you need to change the line of code to:

init_fn = util.tf.get_init_fn(checkpoint_path = FLAGS.checkpoint_path, train_dir = FLAGS.train_dir, 
                          ignore_missing_vars = FLAGS.ignore_missing_vars, checkpoint_exclude_scopes = FLAGS.checkpoint_exclude_scopes, checkpoint_model_scope = 'ssd_300')

BowieHsu commented 7 years ago

thanks, currently I am trying to fine tune on your model follow those steps:

Downloaded the ICDAR 2015 datasets(1000 image files and text files) and convert datasets into tfrecords file using icdar2015_to_tfrecords.py
Run train.sh with tfrecords produced in step 1 , got error "Out of range, FIFOQueue is closed and has insufficient elements(requested 1, current size 0)"

I guess i should modified tfrecords produce config or Slim.Dataprovider config, what is your opinion?

dengdan commented 7 years ago

Could you please give a more detail error description?

BowieHsu commented 7 years ago

Hi, dengdan, i have solved my problem and now i am trying to modified seglink based on my project requirement, many thanks, you can close the issue.

dengdan commented 7 years ago

Good!

sdsy888 commented 7 years ago

Hi, @dengdan . When converting initialization TF-file using this script https://github.com/dengdan/seglink/blob/master/caffe_to_tensorflow.py and this model VGG_coco_SSD_512x512_iter_360000.caffemodel, do I need any other file?

Because when I try to convert the VGG-SSD-COCO model to reproduct the training procedure of ICDAR2015，I got the error below:

Traceback (most recent call last):
  File "caffe_to_tensorflow.py", line 113, in <module>
    check_var(name)
  File "caffe_to_tensorflow.py", line 108, in check_var
    np.testing.assert_almost_equal(actual = np.mean(caffe_weights), desired = np.mean(tf_weights.eval(session)))
AttributeError: 'NoneType' object has no attribute 'eval'

It seems like this problem has something to do with this operation: # check all vgg and extra layer weights/biases have been converted in a right way.

Do you know what's wrong here? And how can I solve this problem? Thank you!

sdsy888 commented 7 years ago

Also, I try another way: like you wrote in the https://github.com/dengdan/seglink/blob/master/train_seglink.py:

If there are checkpoints in train_dir, this config will be ignored.

So I put the checkpoints of yours (model.ckpt-136750) in my train_dir, but still ,it's not working. Looking forward for your reply of solution. Thank you.

BowieHsu commented 7 years ago

@sdsy888 I have trained model successfully, could you supply some figures about situation?

sdsy888 commented 7 years ago

sure, thank you. Please see below:

2017-09-08 08-34-12

This is what happens when I try to convert the caffe_model to tensorflow initialization file. I do get three files but those file is not complete apparently.

2017-09-08 08-39-01

Beside, when I try to use the existing TF checkpoint file to initialize the network, this is what happens:

 # =========================================================================== #
# Training flags:
# =========================================================================== #
{'batch_size': 18,
 'checkpoint_exclude_scopes': None,
 'checkpoint_path': '/home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink',
 'dataset_dir': '/home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR',
 'dataset_name': 'icdar2015',
 'dataset_split_name': 'train',
 'gpu_memory_fraction': -1.0,
 'ignore_missing_vars': True,
 'learning_rate': 0.0001,
 'link_cls_loss_weight': 1.0,
 'log_every_n_steps': 1,
 'max_number_of_steps': 1000000,
 'model_name': 'seglink_vgg',
 'momentum': 0.9,
 'moving_average_decay': 0.9999,
 'num_gpus': 1,
 'num_preprocessing_threads': 1,
 'num_readers': 1,
 'seg_loc_loss_weight': 1.0,
 'train_dir': '/home/neo/PycharmProjects/seglink/models/seglink_Train',
 'train_image_height': 384,
 'train_image_width': 384,
 'train_with_ignored': False,
 'using_moving_average': False,
 'weight_decay': 0.0005}

# =========================================================================== #
# seglink net parameters:
# =========================================================================== #
'max_neg_pos_ratio=3'
'num_links=27660'
'seg_loc_loss_weight=1.0'
'link_conf_threshold=0.5'
'num_clones=1'
"gpus=['/gpu:0']"
'prior_scaling=[0.2, 0.5, 0.2, 0.5, 20.0]'
'max_height_ratio=1.5'
'train_with_ignored=False'
'image_shape=(384, 384)'
"feat_layers=['conv4_3', 'fc7', 'conv6_2', 'conv7_2', 'conv8_2', 'conv9_2']"
'anchor_offset=0.5'
'default_anchors=[[   4.    4.   12.   12.]\n [  12.    4.   12.   12.]\n [  20.    4.   12.   12.]\n ..., \n [ 288.   96.  288.  288.]\n [  96.  288.  288.  288.]\n [ 288.  288.  288.  288.]]'
'__file__=/home/neo/PycharmProjects/seglink/config.pyc'
'batch_size=18'
'batch_size_per_gpu=18'
'link_cls_loss_weight=1.0'
'__name__=config'
'anchor_scale_gamma=1.5'
'data_format=NHWC'
"clone_scopes=['clone_0']"
'num_anchors=3073'
'seg_conf_threshold=0.5'

# =========================================================================== #
# Training | Evaluation dataset files:
# =========================================================================== #
['/home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord']

INFO:tensorflow:Fine-tuning from /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink. Ignoring missing vars: True
########

2017-09-08 08:54:50.487570: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: Failed precondition: /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
  File "train_seglink.py", line 282, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "train_seglink.py", line 278, in main
    train(train_op)
  File "train_seglink.py", line 253, in train
    ignore_missing_vars = FLAGS.ignore_missing_vars, checkpoint_exclude_scopes = FLAGS.checkpoint_exclude_scope)
  File "/home/neo/pylib/src/util/tf.py", line 122, in get_init_fn
    ignore_missing_vars=ignore_missing_vars)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 638, in assign_from_checkpoint_fn
    reader = pywrap_tensorflow.NewCheckpointReader(model_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 110, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern), status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: Failed precondition: /home/neo/PycharmProjects/seglink/models/coco/SSD_512x512/seglink: perhaps your file is in a different file format and you need to use a different restore operator?

I'm not familiar with tensorflow, maybe there are some low-level mistakes. So please point it out If you know what these problems are all about.

Thank you so much!

BTW, I've found the team of Xiang Bai had released their code(even some parts are packed in .so format), I'm working on it.

dengdan commented 7 years ago

caffe_to_tensorflow It's my fault. The code in https://github.com/dengdan/seglink/blob/master/caffe_to_tensorflow.py#L107 is used to make sure that the convert process works well, by comparing the values from caffemodel and tf model. However, I abandoned conv10_* layers in later training. So remove them in the variable layers_to_convert in https://github.com/dengdan/seglink/blob/master/caffe_to_tensorflow.py#L73 .
If you wanna train from an existing checkpoint, put it in your train_dir, and create a checkpoint file for it.

sdsy888 commented 7 years ago

Thank you for ur reply. I succeed in converting the caffemodel. But when I try to train the network use your method, it shows this:

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/neo/PycharmProjects/seglink/models/seglink_Train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord
     [[Node: icdar2015_data_provider/parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](icdar2015_data_provider/parallel_read/TFRecordReaderV2, icdar2015_data_provider/parallel_read/filenames)]]
INFO:tensorflow:global_step/sec: 0
2017-09-08 11:20:20.108605: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
     [[Node: clone_0/fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](prefetch_queue/fifo_queue)]]
2017-09-08 11:20:20.108671: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
     [[Node: clone_0/fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](prefetch_queue/fifo_queue)]]
2017-09-08 11:20:20.108695: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)

......

Traceback (most recent call last):
  File "train_seglink.py", line 282, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "train_seglink.py", line 278, in main
    train(train_op)
  File "train_seglink.py", line 267, in train
    session_config = sess_config
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 759, in train
    sv.saver.save(sess, sv.save_path, global_step=sv.global_step)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run
    target_list_as_strings, status, None)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord
     [[Node: icdar2015_data_provider/parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](icdar2015_data_provider/parallel_read/TFRecordReaderV2, icdar2015_data_provider/parallel_read/filenames)]]

The data files are generated using the scripts in the dir /dataset, do you have any idea what this error means?

Thank you!

dengdan commented 7 years ago

tensorflow.python.framework.errors_impl.NotFoundError: /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord Do you haveicdar2015_train in this directory?

sdsy888 commented 7 years ago

Yes, I convert the icdar2015_train images in that dir to tfrecord using your code in dataset.

dengdan commented 7 years ago

Are you sure? Execute the command in your terminal to make sure of it:

ls /home/neo/Dataset/ICDAR2015/TextLocalization/ICDAR/icdar2015_train.tfrecord

sdsy888 commented 7 years ago

My fault. The TF data file generating code outputs the icdar2015_train_tfrecord while the network try to find icdar2015_train.tfrecord.

Heterfire commented 6 years ago

===========================================================================

Training | Evaluation dataset files:

===========================================================================

['/home/heter/dataset/SSD-tf/ICDAR/icdar2015_train.tfrecord']

INFO:tensorflow:Fine-tuning from None. Ignoring missing vars: True

…… reader = pywrap_tensorflow.NewCheckpointReader(model_path) File "/home/heter/anaconda3/envs/pixel_link/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 110, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/home/heter/anaconda3/envs/pixel_link/lib/python2.7/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes (bytes_or_text,)) TypeError: Expected binary or unicode string, got None @dengdan

dengdan / seglink

How to train on my own datasets? #3

===========================================================================

Training | Evaluation dataset files:

===========================================================================