Speed up! - Githubissues

ChulhoonJang commented 7 years ago

I uses NVIDIA Quadro M2000M as a gpu to get multi classes inference based on the FCN model.

The input image is 320 x 160 and the output is a same size, but 4 channels. sample video

In my computing environment, the execution takes 200 ms per image, but I want to speed up at least 4 times.

Is there any idea to accelerate the inference process?

MarvinTeichmann commented 7 years ago

Yes, you get a factor two, if you discard the layers fc6 and fc7 and build the fcn on top of pool5. This reduces the field of view, but for road segmentation the performance (maxF1) impact is negligible.

Using pool5 as fcn input is already implemented, you just need to use the KittiSeg_VGG.json. The option ['arch']['fcn_in'] defines the layer to be used as fcn input. The current default is the faster pool5 layer. It gives the best speed / performance trade-off in my experiments.

For other speed-up stuff follow the tensorflow guide. Honorable mentioned goes to the following thinks:

Use newest cuda / cudnn Version
Compile from source with SSE4.2/AVX/FMA optimizations enabled
Use input queues
Use NCHW data processing

All those stuff together can give you another factor of 1.5 - 2. So an overall improvement of just under a factor of 4 can be done.

Lastly, if it is an option you can use batched input (even for inference). Batching input can give you a very large speed increase. (This depends on the GPU model: The more stream processors there are on it, the more data can be processed in parallel). I don't use batched inference as this is not a fair benchmark for robotics application. In real-time apps it can be assumed that input images get available one at a time (i.e. using a camera). Batching is not an option there.

ChulhoonJang commented 7 years ago

@MarvinTeichmann , Thank you for your insightful advices.

I have some questions

For discarding fc6 and fc7 layers, shoud I train a new FCN model again?
Is there any perforance degradation due to discarding the layers?

As you said, in my case, I use a video camera in real-time apps, so batching is not an option unfortunately.

MarvinTeichmann commented 7 years ago

For discarding fc6 and fc7 layers, should I train a new FCN model again?

Yes, you need to train a new model without this layers.

Is there any performance degradation due to discarding the layers?

As I said, removing this layers reduces the field of view of the model. On road data the performance impact is negligible. You will need to try it yourself on your data to see how it behaves.

I should properly make pool5 the default fcn input. In my experiments it offers the best performance speed trade-off. So I highly recommend trying it.

ChulhoonJang commented 7 years ago

@MarvinTeichmann. Thanks again. I will try and share the result later!

obendidi commented 7 years ago

I'm using pool5 as the fcn input, and got it working at 3 fps (I didn't do any optimization or input queues) the results were note that much different from using fc6/fc7 as fcn input (sometimes even better)

villanuevab commented 7 years ago

Using pool5, batch size of 1, TF from sources with XLA enabled, and default input resolution of (384, 1248), I am getting an inference speed of approximately 700ms, < 2fps, on an NVIDIA Jetson device. I am using a python script to run inference (versus compiling my inference script into an executable).

obendidi commented 7 years ago

i'm using a GTX 1080 for inference and running from python script too, i did an average on 100 predictions and i got around 2.7 fps (using tf 1.0.1 with cuda and cudnn) image size was ( 600,1200)

MarvinTeichmann commented 7 years ago

Those numbers sound rather low to me. Are your guys only measuring inference (e.g. sess.run) or also the creation of the overlay (seg.make_overlay(image, output_image)). Latter is cosmetic post-processing done on CPU.

ChulhoonJang commented 7 years ago

I finished the training for the new model with pool5 input and the config json file was hypes.txt. (json extension is not supported, so I modified it).

The evaluation results are quite good because MaxF1 and Average precision are almost same as the old model (fc7 model). Here are output logs fc7_model.txt vs. pool5_model.txt

I checked out the execution time as below. with GPU (NVIDIA 680) Speed (msec): 189.79954719543457 Speed (fps): 5.268716468381827 with CPU Speed (msec) (smooth) : 312.7150 Speed (fps) (smooth) : 3.1978

Unfortunately, there is no time improvement comparing the fc7 model. What is wrong with it?

villanuevab commented 7 years ago

@MarvinTeichmann, here is the relevant portion of inference code. As far as I can tell, we are only measuring inference time:

while True:
    frame = _grab_video_feed()
    if frame is None:
        raise SystemError('Issue grabbing the frame')
    # resize to default KittiSeg input for now
    frame = cv2.resize(
        frame, (shape[1], shape[0]), interpolation=cv2.INTER_CUBIC)
    numpy_final = np.asarray(frame)
    numpy_final = np.expand_dims(numpy_final, axis=0)

    start_time = timeit.default_timer() # start timing inference
    predictions = sess.run(
        softmax_tensor, {'Inputs/fifo_queue_DequeueMany:0': numpy_final})
    time_taken = (timeit.default_timer() - start_time) # end timing inference 
    print('Took {} secs to perform inference'.format(time_taken))

    # the rest of this script concerns cosmetic post-processing
    output_image = predictions.reshape(shape[0], shape[1], -1)
    x = np.argmax(output_image, axis=2)
    segmented_img = np.zeros((shape[0], shape[1], 3), dtype=np.uint8)
    # convert output to color scheme defined by CLASS_COLORS
    for i, _ in enumerate(x):
        for j, _ in enumerate(x[i]):
            value = x[i][j]
            color_code = CLASS_COLORS[value]
            segmented_img[i][j] = color_code
    # overlay segmentation onto original image
    final_img = _blend_non_transparent(frame, segmented_img)
    # show overlayed image
    cv2.imshow('Prediction', final_img)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        sess.close()
        break

obendidi commented 7 years ago

I've redone the test with only measuring inference (sess.run) (in the last tests I calculated the whole process from reading the image and prepossessing to doing the predictions) and I've got better results, here are some info: GPU : NVIDIA GTX 1080 IMAGE SIZE : (700,1200,3) NUM CLASSES : 4 BATCH SIZE : 1 FCN input : Pool5 snippet code for speed test :

logging.info("Testing network speed on {} images".format(100))

start_time = time.time()
for i in xrange(100):
      sess.run([softmax], feed_dict={image_pl:image})
dt = (time.time() - start_time)/100

logging.info("Network speed during Inference is :")
logging.info("\tSpeed (sec): {}".format(dt))
logging.info("\tSpeed (msec): {}".format(1000*dt))
logging.info("\tSpeed (fps): {}".format(1/dt))

results :

2017-07-24 11:01:16,179 INFO Weights loaded successfully.
2017-07-24 11:01:16,734 INFO Testing network speed on 100 images
I tensorflow/core/kernels/logging_ops.cc:79] Shape of Validation/pool5:0[1 22 38 512]
I tensorflow/core/kernels/logging_ops.cc:79] Shape of upscore2[1 44 75 4]
I tensorflow/core/kernels/logging_ops.cc:79] Shape of upscore4[1 88 150 4]
I tensorflow/core/kernels/logging_ops.cc:79] Shape of upscore32[1 700 1200 4]
2017-07-24 11:01:28,033 INFO Network speed during Inference is :
2017-07-24 11:01:28,033 INFO    Speed (sec): 0.112987201214
2017-07-24 11:01:28,033 INFO    Speed (msec): 112.987201214
2017-07-24 11:01:28,033 INFO    Speed (fps): 8.85055996836

so it's close to 9 fps

villanuevab commented 7 years ago

@bendidi I am also attempting to train and run inference using multiple classes. Can you share how you saved your TF graph for inference i.e., what graph optimizations you ran, if any? Freezing my graph yields:

Converted 35 variables to const ops.
194 ops in the final graph.

My current best guess for why my model is running so slowly is that there are excess nodes in the graph that are not needed for inference.

My changes to the source code are almost identical to @shivam-kotwalia's, here: https://github.com/shivam-kotwalia/KittiSeg/

[UPDATE: I tested the same script and .pb on a 1080Ti as well.] Avg. speed of 100 consecutive calls to sess.run() on 1080 Ti:

Inference speed (sec): 0.054193212054669856
Inference speed (fps): 18.452495471041736

Avg. speed of 100 consecutive calls to sess.run() on TX2 in NV Power Mode: MAXN:

Inference speed (sec): 0.7519816180500039
Inference speed (fps): 1.3298197402659169

Perhaps I am just reaching the limits of the TX2's performance.

obendidi commented 7 years ago

I didn't run any graph optimization nor I have XLA enabled ( It's in my TO DO list ^^ )

MarvinTeichmann commented 7 years ago

@villanuevab With a Titan X Pascal I was able to get almost 25 fps on my latest run. Around 18.45 on 1080 TI sounds reasonable. Especially if you did not pressed to hard to get the last bit of improvement out of TensorFlow.

@bendidi Your running time is can largely be explained by the larger image size. The runtime is roughly linear with respect to the amount of pixel of the input image. So 9 fps on your size will translate to 18 fps on kitti size input which sounds competitive for a 1080.

obendidi commented 7 years ago

@villanuevab I've also freezed the graph and got the same output as you :

Converted 35 variables to const ops.
194 ops in the final graph.

using Validation/Validation/decoder/Softmax as output nodes

The thing is I've got slightly worse results in terms of speed using the .pb file :

2017-07-31 18:44:47,900 INFO Network speed during Inference is :
2017-07-31 18:44:47,901 INFO    Speed (sec): 0.118378000259
2017-07-31 18:44:47,901 INFO    Speed (msec): 118.378000259
2017-07-31 18:44:47,901 INFO    Speed (fps): 8.44751556716

compared to using the .data + .meta files :

2017-07-24 11:01:28,033 INFO Network speed during Inference is :
2017-07-24 11:01:28,033 INFO    Speed (sec): 0.112987201214
2017-07-24 11:01:28,033 INFO    Speed (msec): 112.987201214
2017-07-24 11:01:28,033 INFO    Speed (fps): 8.85055996836

(tested multiple times ) Is it normal or is it me that is doing something wrong ? Thank you

ChulhoonJang commented 7 years ago

@MarvinTeichmann I found the reason why the pool5 model was not fast.

In #L38, it seems that this code is not configurable depending on 'fcn_in' in hypes.

I found the configurable code in /decoder/fcn.py

However, I trained the pool5 model with kitti_multiloss.py, so the result was not changed comparing with the fc7 model.

Now, I am trying to do re-training after applying the configurable code to 'kitti_multiloss.py'. I will share the result again.

Could you tell me what is different between fcn,py and kitty_multiloss.py?

Thank a lot.

villanuevab commented 7 years ago

@bendidi, unfortunately, I have not done any systematic comparisons between loading the graph via .pb and (.data + .meta) files. I do not think I'll be able to any time soon, but please keep me posted on your progress and let me know if there's any other information I can provide.

MarvinTeichmann / KittiSeg

Speed up! #94