Open ChulhoonJang opened 7 years ago
Yes, you get a factor two, if you discard the layers fc6
and fc7
and build the fcn on top of pool5
. This reduces the field of view, but for road segmentation the performance (maxF1) impact is negligible.
Using pool5
as fcn input is already implemented, you just need to use the KittiSeg_VGG.json. The option ['arch']['fcn_in'] defines the layer to be used as fcn input. The current default is the faster pool5 layer. It gives the best speed / performance trade-off in my experiments.
For other speed-up stuff follow the tensorflow guide. Honorable mentioned goes to the following thinks:
All those stuff together can give you another factor of 1.5 - 2. So an overall improvement of just under a factor of 4 can be done.
Lastly, if it is an option you can use batched input (even for inference). Batching input can give you a very large speed increase. (This depends on the GPU model: The more stream processors there are on it, the more data can be processed in parallel). I don't use batched inference as this is not a fair benchmark for robotics application. In real-time apps it can be assumed that input images get available one at a time (i.e. using a camera). Batching is not an option there.
@MarvinTeichmann , Thank you for your insightful advices.
I have some questions
For discarding fc6 and fc7 layers, shoud I train a new FCN model again?
Is there any perforance degradation due to discarding the layers?
As you said, in my case, I use a video camera in real-time apps, so batching is not an option unfortunately.
For discarding fc6 and fc7 layers, should I train a new FCN model again?
Yes, you need to train a new model without this layers.
Is there any performance degradation due to discarding the layers?
As I said, removing this layers reduces the field of view of the model. On road data the performance impact is negligible. You will need to try it yourself on your data to see how it behaves.
I should properly make pool5
the default fcn input. In my experiments it offers the best performance speed trade-off. So I highly recommend trying it.
@MarvinTeichmann. Thanks again. I will try and share the result later!
I'm using pool5 as the fcn input, and got it working at 3 fps (I didn't do any optimization or input queues) the results were note that much different from using fc6/fc7 as fcn input (sometimes even better)
Using pool5
, batch size of 1, TF from sources with XLA enabled, and default input resolution of (384, 1248)
, I am getting an inference speed of approximately 700ms, < 2fps, on an NVIDIA Jetson device. I am using a python script to run inference (versus compiling my inference script into an executable).
i'm using a GTX 1080 for inference and running from python script too, i did an average on 100 predictions and i got around 2.7 fps (using tf 1.0.1 with cuda and cudnn) image size was ( 600,1200)
Those numbers sound rather low to me. Are your guys only measuring inference (e.g. sess.run
) or also the creation of the overlay (seg.make_overlay(image, output_image)
). Latter is cosmetic post-processing done on CPU.
I finished the training for the new model with pool5 input and the config json file was hypes.txt. (json extension is not supported, so I modified it).
The evaluation results are quite good because MaxF1 and Average precision are almost same as the old model (fc7 model). Here are output logs fc7_model.txt vs. pool5_model.txt
I checked out the execution time as below. with GPU (NVIDIA 680) Speed (msec): 189.79954719543457 Speed (fps): 5.268716468381827 with CPU Speed (msec) (smooth) : 312.7150 Speed (fps) (smooth) : 3.1978
Unfortunately, there is no time improvement comparing the fc7 model. What is wrong with it?
@MarvinTeichmann, here is the relevant portion of inference code. As far as I can tell, we are only measuring inference time:
while True:
frame = _grab_video_feed()
if frame is None:
raise SystemError('Issue grabbing the frame')
# resize to default KittiSeg input for now
frame = cv2.resize(
frame, (shape[1], shape[0]), interpolation=cv2.INTER_CUBIC)
numpy_final = np.asarray(frame)
numpy_final = np.expand_dims(numpy_final, axis=0)
start_time = timeit.default_timer() # start timing inference
predictions = sess.run(
softmax_tensor, {'Inputs/fifo_queue_DequeueMany:0': numpy_final})
time_taken = (timeit.default_timer() - start_time) # end timing inference
print('Took {} secs to perform inference'.format(time_taken))
# the rest of this script concerns cosmetic post-processing
output_image = predictions.reshape(shape[0], shape[1], -1)
x = np.argmax(output_image, axis=2)
segmented_img = np.zeros((shape[0], shape[1], 3), dtype=np.uint8)
# convert output to color scheme defined by CLASS_COLORS
for i, _ in enumerate(x):
for j, _ in enumerate(x[i]):
value = x[i][j]
color_code = CLASS_COLORS[value]
segmented_img[i][j] = color_code
# overlay segmentation onto original image
final_img = _blend_non_transparent(frame, segmented_img)
# show overlayed image
cv2.imshow('Prediction', final_img)
if cv2.waitKey(1) & 0xFF == ord('q'):
sess.close()
break
I've redone the test with only measuring inference (sess.run) (in the last tests I calculated the whole process from reading the image and prepossessing to doing the predictions) and I've got better results, here are some info: GPU : NVIDIA GTX 1080 IMAGE SIZE : (700,1200,3) NUM CLASSES : 4 BATCH SIZE : 1 FCN input : Pool5 snippet code for speed test :
logging.info("Testing network speed on {} images".format(100))
start_time = time.time()
for i in xrange(100):
sess.run([softmax], feed_dict={image_pl:image})
dt = (time.time() - start_time)/100
logging.info("Network speed during Inference is :")
logging.info("\tSpeed (sec): {}".format(dt))
logging.info("\tSpeed (msec): {}".format(1000*dt))
logging.info("\tSpeed (fps): {}".format(1/dt))
results :
2017-07-24 11:01:16,179 INFO Weights loaded successfully.
2017-07-24 11:01:16,734 INFO Testing network speed on 100 images
I tensorflow/core/kernels/logging_ops.cc:79] Shape of Validation/pool5:0[1 22 38 512]
I tensorflow/core/kernels/logging_ops.cc:79] Shape of upscore2[1 44 75 4]
I tensorflow/core/kernels/logging_ops.cc:79] Shape of upscore4[1 88 150 4]
I tensorflow/core/kernels/logging_ops.cc:79] Shape of upscore32[1 700 1200 4]
2017-07-24 11:01:28,033 INFO Network speed during Inference is :
2017-07-24 11:01:28,033 INFO Speed (sec): 0.112987201214
2017-07-24 11:01:28,033 INFO Speed (msec): 112.987201214
2017-07-24 11:01:28,033 INFO Speed (fps): 8.85055996836
so it's close to 9 fps
@bendidi I am also attempting to train and run inference using multiple classes. Can you share how you saved your TF graph for inference i.e., what graph optimizations you ran, if any? Freezing my graph yields:
Converted 35 variables to const ops.
194 ops in the final graph.
My current best guess for why my model is running so slowly is that there are excess nodes in the graph that are not needed for inference.
My changes to the source code are almost identical to @shivam-kotwalia's, here: https://github.com/shivam-kotwalia/KittiSeg/
[UPDATE: I tested the same script and .pb on a 1080Ti as well.]
Avg. speed of 100 consecutive calls to sess.run()
on 1080 Ti:
Inference speed (sec): 0.054193212054669856
Inference speed (fps): 18.452495471041736
Avg. speed of 100 consecutive calls to sess.run()
on TX2 in NV Power Mode: MAXN
:
Inference speed (sec): 0.7519816180500039
Inference speed (fps): 1.3298197402659169
Perhaps I am just reaching the limits of the TX2's performance.
I didn't run any graph optimization nor I have XLA enabled ( It's in my TO DO list ^^ )
@villanuevab With a Titan X Pascal I was able to get almost 25 fps on my latest run. Around 18.45
on 1080 TI sounds reasonable. Especially if you did not pressed to hard to get the last bit of improvement out of TensorFlow.
@bendidi Your running time is can largely be explained by the larger image size. The runtime is roughly linear with respect to the amount of pixel of the input image. So 9 fps
on your size will translate to 18 fps
on kitti size input which sounds competitive for a 1080
.
@villanuevab I've also freezed the graph and got the same output as you :
Converted 35 variables to const ops.
194 ops in the final graph.
using Validation/Validation/decoder/Softmax
as output nodes
The thing is I've got slightly worse results in terms of speed using the .pb file :
2017-07-31 18:44:47,900 INFO Network speed during Inference is :
2017-07-31 18:44:47,901 INFO Speed (sec): 0.118378000259
2017-07-31 18:44:47,901 INFO Speed (msec): 118.378000259
2017-07-31 18:44:47,901 INFO Speed (fps): 8.44751556716
compared to using the .data + .meta files :
2017-07-24 11:01:28,033 INFO Network speed during Inference is :
2017-07-24 11:01:28,033 INFO Speed (sec): 0.112987201214
2017-07-24 11:01:28,033 INFO Speed (msec): 112.987201214
2017-07-24 11:01:28,033 INFO Speed (fps): 8.85055996836
(tested multiple times ) Is it normal or is it me that is doing something wrong ? Thank you
@MarvinTeichmann I found the reason why the pool5 model was not fast.
In #L38, it seems that this code is not configurable depending on 'fcn_in' in hypes.
I found the configurable code in /decoder/fcn.py
However, I trained the pool5 model with kitti_multiloss.py, so the result was not changed comparing with the fc7 model.
Now, I am trying to do re-training after applying the configurable code to 'kitti_multiloss.py'. I will share the result again.
Could you tell me what is different between fcn,py and kitty_multiloss.py?
Thank a lot.
@bendidi, unfortunately, I have not done any systematic comparisons between loading the graph via .pb and (.data + .meta) files. I do not think I'll be able to any time soon, but please keep me posted on your progress and let me know if there's any other information I can provide.
I uses NVIDIA Quadro M2000M as a gpu to get multi classes inference based on the FCN model.
The input image is 320 x 160 and the output is a same size, but 4 channels. sample video
In my computing environment, the execution takes 200 ms per image, but I want to speed up at least 4 times.
Is there any idea to accelerate the inference process?