Training gets stuck in the generator - Githubissues

argman / EAST

A tensorflow implementation of EAST text detector

GNU General Public License v3.0

3.01k stars 1.05k forks source link

Training gets stuck in the generator #303

Open luhgit opened 5 years ago

luhgit commented 5 years ago

Hi,

I am training the EAST model using the following command on my own images:

python multigpu_train.py --gpu_list=0 --input_size=512 --batch_size_per_gpu=14 --checkpoint_path=tmp/east_icdar2015_resnet_v1_50_rbox/ --text_scale=512 --training_data_path=data/train/ --geometry=RBOX --learning_rate=0.0001 --num_readers=24 --pretrained_model_path=tmp/resnet_v1_50.ckpt

The problem I have is that it does not reach to the training stage it gets stuck in the generator function or more precisely get_batch() function.

Here is the output from console:

Use standard file APIs to check for files with this prefix.
step 0
Generator use 10 batches for buffering, this may take a while, you can tune this yourself.
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/

It does not move forward from here then I checked the code of get_batch() function and I found that it gets stuck in else condition (commented below in the code) forever.

def get_batch(num_workers, **kwargs):
    try:
        enqueuer = GeneratorEnqueuer(generator(**kwargs), use_multiprocessing=True)
        print('Generator use 10 batches for buffering, this may take a while, you can tune this yourself.')
        enqueuer.start(max_queue_size=10, workers=num_workers)
        generator_output = None
        while True:
            while enqueuer.is_running():
                if not enqueuer.queue.empty():
                    generator_output = enqueuer.queue.get()
                    break
                else:
                    # The control comes here but never get out of here!
                    time.sleep(0.01)
            yield generator_output
            generator_output = None
    finally:
        if enqueuer is not None:
            enqueuer.stop()

My CPU is almost idle:

Processes: 496 total, 3 running, 1 stuck, 492 sleeping, 2512 threads                            17:24:51
Load Avg: 1.49, 1.96, 2.31  CPU usage: 8.81% user, 9.74% sys, 81.43% idle

I am using Tensorflow: 1.13.2 and OpenCV 4 on a Macbook Pro machine.

Does someone else also faced the same problem? If so how did you fix it?

Thanks!

ghost commented 5 years ago

3 training images is not enough, use 10 + images, because you use 10 batches

luhgit commented 5 years ago

I am now using 16 images and the problem still persists.

Generator use 10 batches for buffering, this may take a while, you can tune this yourself.
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/

ghost commented 5 years ago

you use GPU or CPU?

luhgit commented 5 years ago

I am using only CPU because I have built-in intel graphic card which I guess is not supported by tensorflow?

ghost commented 5 years ago

training even one epoh on cpu will take a very long time, use google colab with gpu

luhgit commented 5 years ago

How do you suggest to run this github project in Google Colab? I am right now running it through terminal as it takes command line arguments!

ghost commented 5 years ago

%cd /content !git clone https://github.com/argman/EAST

%cd /content/EAST !python eval.py --test_data_path=/training_samples/ --gpu_list=0 --checkpoint_path=MYPATH\ --output_dir=/tmp/

ghost commented 5 years ago

there's no problem running this code in the lab

ghost commented 5 years ago

when after a training you want to freeze pb, ask me, i will explain how to do this in the lab

luhgit commented 5 years ago

Oh Perfect! Thank you very much for the hint. I will try to train it there: I hope this time it will not have the problem I had on local machine. I will come back to you after I execute it there!

ghost commented 5 years ago

Ok, don't forget to change runtime type to GPU mode in menu

luhgit commented 5 years ago

Oh yeah, I almost forgot! I am now running the training in colab! Now the problem is disappeared! You were right it was the issue of speed on CPU vs GPU. Once the training is complete, how do I preserve the model for further prediction?

Screenshot 2019-09-02 at 11 52 52

ghost commented 5 years ago

1. First you need to save your trained checkpoint files

To do this replace your eval.py with this file, https://yadi.sk/d/B2qL9iYpDvDoBA, change line number 154 as needed in your new eval.py file

1.1 run in colab something like this !python eval.py --test_data_path="/PATH TO .jpg IMAGES" --gpu_list=0 --checkpoint_path="/PATH TO CHECKPOINT FILES/" \ --output_dir="/content/EAST/test_result"

2. To freeze saved filed (see line 154 in eval.py) use this file https://yadi.sk/d/FAALJEEk6tQWpQ like this

!python "/FULL PATH TO FILE.freeze.py" --model_dir="/content/EAST/saved" --output_node_names="feature_fusion/Conv_7/Sigmoid,feature_fusion/concat_3"

ghost commented 5 years ago

3.

https://github.com/spmallick/learnopencv/tree/master/TextDetectionEAST

ghost commented 5 years ago

or download TextDetection.py https://yadi.sk/d/72iA8zmoX8Ffvw and run %cd /content/ !python "/content/drive/My Drive/TextDetection.py" --input "/content/test.jpg" \ --thr=0.5 \ --nms=0.5 \ --model "/content/EAST/saved/frozen_model.pb" \ --width=512 \ --height=512

ghost commented 5 years ago

you'll get out.jpg in the same folder with test.jpg file, just press refresh

ghost commented 5 years ago

and the last some my training images

https://yadi.sk/i/4iLMlOMXonW9Pg https://yadi.sk/i/vK2-03MOx2iuYQ https://yadi.sk/i/uwCIjoQZF2HgkQ

https://yadi.sk/d/o38Voy7qNAxMkw 154 MB, good luck

luhgit commented 5 years ago

Thank you very much for helping me on this. I am looking forward to the end of the training and implement your suggestions!

ghost commented 5 years ago

You're welcome

ghost commented 5 years ago

Oh my God I can't stop posting, somebody kill me

SpringRainLu commented 4 years ago

System information (version) OpenCV =>4.12 Operating System / Platform => Windows 64 Bit Compiler => PyCharm 2018 CE Detailed description i tried to run text detection.py based on my own east model, but it failed at ''outs = net.forward(outNames)''

cv2.error: OpenCV(4.1.1) .\opencv-python\opencv\modules\dnn\src\dnn.cpp:525: error: (-2:Unspecified error) Can't create layer "resnet_v1_50/conv1/BatchNorm/FusedBatchNormV3" of type "FusedBatchNormV3" in function 'cv::dnn::dnn4_v20190621::LayerData::getLayerInstance'

i saved my model in this:

output_graph = "frozen_east_model_02.pb" output_graph_def = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ["feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"]) tf.train.write_graph(output_graph_def, ".", output_graph, as_text=False)

i have tried to modify model.py ， nevertheless it did not work. https://github.com/argman/EAST/blob/master/model.py Line 150

c1_1 = slim.conv2d(tf.concat([g[i-1], f[i]], axis=3), num_outputs[i], 1) pi2 = 0.5 np.pi angle_map = (slim.conv2d(g[3], 1, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) - 0.5) pi2 # angle is between [-45, 45] F_geometry = tf.concat([geo_map, angle_map], axis=3)

@SmallDonkey

ghost commented 4 years ago

Use tensorflow==1.14

zzcqinag commented 4 years ago

I follow step 2 ,but the error happend: AssertionError: feature_fusion/Conv_7/Sigmoid is not in graph why can l solve the problem? thank you! @SmallDonkey tensorflow==1.14.0

zzcqinag commented 4 years ago

System information (version) OpenCV =>4.12 Operating System / Platform => Windows 64 Bit Compiler => PyCharm 2018 CE Detailed description i tried to run text detection.py based on my own east model, but it failed at ''outs = net.forward(outNames)''

cv2.error: OpenCV(4.1.1) .\opencv-python\opencv\modules\dnn\src\dnn.cpp:525: error: (-2:Unspecified error) Can't create layer "resnet_v1_50/conv1/BatchNorm/FusedBatchNormV3" of type "FusedBatchNormV3" in function 'cv::dnn::dnn4_v20190621::LayerData::getLayerInstance'

i saved my model in this:

output_graph = "frozen_east_model_02.pb" output_graph_def = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ["feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"]) tf.train.write_graph(output_graph_def, ".", output_graph, as_text=False)

i have tried to modify model.py ， nevertheless it did not work. https://github.com/argman/EAST/blob/master/model.py Line 150

c1_1 = slim.conv2d(tf.concat([g[i-1], f[i]], axis=3), num_outputs[i], 1) pi2 = 0.5 np.pi angle_map = (slim.conv2d(g[3], 1, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) - 0.5) pi2 # angle is between [-45, 45] F_geometry = tf.concat([geo_map, angle_map], axis=3)

@SmallDonkey

l meet the same problem ,did you solve this?

hmen97 commented 3 years ago

I follow step 2 ,but the error happend: AssertionError: feature_fusion/Conv_7/Sigmoid is not in graph why can l solve the problem? thank you! @SmallDonkey tensorflow==1.14.0

https://github.com/argman/EAST/issues/277#issuecomment-507749717

LANDDKPLA commented 5 months ago

Solved by replacing these lines in icdar.py.

        # enqueuer = GeneratorEnqueuer(generator(**kwargs), use_multiprocessing=True)
        # enqueuer.start(max_queue_size=10, workers=num_workers)
        enqueuer = GeneratorEnqueuer(generator(**kwargs), use_multiprocessing=False)
        enqueuer.start(max_queue_size=10, workers=1)