Implementation of Single Shot MultiBox Detector (SSD) in TensorFlow, to detect and classify traffic signs. This implementation was able to achieve 40-45 fps on a GTX 1080 with an Intel Core i7-6700K.
Note this project is still work-in-progress. The main issue now is model overfitting. I am currently working on pre-training on VOC2012 first, then performing transfer learning over to traffic sign detection.
Currently only stop signs and pedestrian crossing signs are detected. Example detection images are below.
The model was trained on the LISA Traffic Sign Dataset, a dataset of US traffic signs.
Clone this repository somewhere, let's refer to it as $ROOT
Training the model from scratch:
$LISA_DATA
cd $LISA_DATA
cp $ROOT/data_gathering/create_pickle.py $LISA_DATA
python create_pickle.py
cd $ROOT
ln -s $LISA_DATA/resized_images_* .
ln -s $LISA_DATA/data_raw_*.p .
python data_prep.py
python train.py
python inference.py -m demo
-i
flag in inference.py (see the code for more details)
Obivously, we are only detecting certain traffic signs in this implementation, whereas the original SSD implemetation detected a greater number of object classes in the PASCAL VOC and MS COCO datasets. Other notable differences are:
As mentioned above, this SSD implementation was able to achieve 40-45 fps on a GTX 1080 with an Intel Core i7 6700K.
The inference time is the sum of the neural network inference time, and Non-Maximum Suppression (NMS) time. Overall, the neural network inference time is significantly less than the NMS time, with the neural network inference time generally between 7-8 ms, whereas the NMS time is between 15-16 ms. The NMS algorithm implemented here has not been optimized, and runs on CPU only, so further effort to improve performance can be done there.
The entire LISA Traffic Sign Dataset consists of 47 distinct traffic sign classes. Since we are only concered with a subset of those classes, we only use a subset of the LISA dataset. Also, we ignore all training samples where we do not find a matching default box, further reducing our dataset's size. Due to this process, we end up with very little data to work with.
In order to improve on this issue, we can perform image data augmentation, and/or pre-train the model on a larger dataset (e.g. VOC2012, ILSVRC)
Given the small size of our pruned dataset, I chose a train/validation split of 95/5. The model was trained with Adadelta optimizers, with the default parameters provided by TensorFlow. The model was trained over 200 epochs, with a batch size of 32.
There are multiple potential areas of improvement in this project: