A novel architectural design for stitching video streams in real-time on an FPGA.
Explore the docs »
The designed architecture generates a video having a wider feild of view by stitching two video input based on features and keypoints. In simple terms, the output generated will be a panorama but with video. The architecture is optimized such that the output can be produced in real-time.
The figure below illustrates the block diagram of the system depicting each step of the algorithm.
The system can be broadly divided into three subystems:
The input video stream for the system is in 8 bit RGB format. The input 8 bit image is shown in figure. Each individual frame of the video stream will have three channels corresponding to red, green and blue. The colour information in the video frames does not enhance feature detection. Moreover, computation on a 3 channel 8 bit image takes more time compared to a single channel 8 bit image. Therefore, the RGB video frame is converted to an 8 bit grayscale image. The generated grayscale images will have lesser noise, more details in the shadows and provides better computational efficiency, shown in figure.
Input image | Grayscale image |
Feature extraction from the grayscale images is done using SIFT algorithm. SIFT algorithm can be separated into two main steps:
Keypoint Detection
SIFT operation begins with discrete convolution of the input image with different Gaussian filters. A Gaussian filter is a widely used image smoothing algorithm defined as:
In the above equation, G is the Gaussian kernel at the point (x, y) and σ is the Gaussian parameter. Using a larger value of σ produces a greater smoothing effect on the image. Discrete convolution of the image with Gaussian kernel generates an image with lesser noise and lesser details. In SIFT, discrete convolution with Gaussian kernel is done with four different values of σ. Progressively higher values of σ is used to generate a set of blurred images or an octave. | |||||
---|---|---|---|---|---|
Input image | Sigma = 1.6 | Sigma = 2.26 | Sigma = 3.2 | Sigma = 4.5 |
For a given value of σ, the sum of all coefficients in the convolution kernal should be equal to unity. Therefore, the size of the kernal increases as the value of σ increases.
Once the octave is generated, a DoG space is built based on the four images in the octave. DoG stands for difference of Gaussian. DoG is a very computationally efficient approximation of Laplacian of Gaussian (LoG). The DoG space is built by computing the difference between two adjacent Gaussian scale images, pixel by pixel. DoG space of four images in the octave will have three levels. | |||
---|---|---|---|
Top level DoG | Middle level DoG | Bottom level DoG |
Keypoints are extracted from the DoG space by finding the local maxima or minima values. A pixel is considered a keypoint if it is a local maxima or minima within a 26 pixel neighbourhood consisting of 9 pixels in the top level, 8 pixels in the middle level and 9 pixels in the bottom level.
Keypoints
Keypoints using OpenCV sift function | Keypoints using SIFT implementation in Python | Keypoint generated by the FPGA design |
Keypoint descriptor is a unique identifier for a particular keypoint. SIFT uses gradient magnitude and direction of the keypoint as the basis for the descriptor. Gradient magnitude and direction at a point can be calculated by discrete convolution of the image with Sobel filters.
Sobel convolution output
To generate the keypoint descriptor, gradient magnitude and direction of every point inside a 16x16 window around each keypoint is calculated. The gradient magnitudes of the 16x16 window is convolved with a Gaussian kernel. The gradient magnitudes in every 4x4 cell is combined such that the 16x16 window is reduced to a 4x4 window and 16 gradient directions. Finally, these 16 gradient directions are transferred into eight bins. Hence a 128 element vector is built which acts as the keypoint descriptor.
Frame stitching is the process of combining two frames into a single image. Frame stitching is done in two steps:
Keypoint Matching
The keypoint descriptors of keypoints in the video frames from both camera sensors are compared. If the difference between the keypoint descriptors of two keypoints, one from each camera sensor, is below a error threshold, then they are considered as a keypoint pair. The keypoint pair with the least difference between their keypoint descriptors is taken as the reference keypoints. | ||
---|---|---|
Input image from left camera | Input image from right camera |
Image Blending
A weighed average method is used to blend the two frames into a single image. The values of pixels in the overlapped region is equal to the weighted average values of pixels of both the frames. The weights are chosen based on the distance between the overlapped pixel and the border of the corresponding frame.
Stitched image
The block schematic of the architecture from top level is shown in figure below.
Block Schematic
The top level design is divided into five stages:
The following packages needs to be installed on the Linux system before executing the source code.
Icarus Verilog
apt-get install iverilog
Python
apt-get install python3
OpenCV
pip3 install opencv-contrib-python
numpy
pip3 install numpy
PIL (Python Image Library)
pip3 install pillow
git clone https://github.com/AugustinJose1221/FPGA-Build.git
cd FPGA-Build/make
make create
make simulate
python3 hexToImage.py
See the open issues for a list of proposed features (and known issues).
Any contributions you make are greatly appreciated.
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)Distributed under the MIT License. See LICENSE
for more information.
Twitter: @augustinjose121
Gmail: augustinjose1221@gmail..com
Discuss: Github Discussions