fuyukawasann / GRIK

This project, developed for the 2024 graduation project at Hanyang University, utilizes YOLOv7-tiny to recognize handwritten notes in video lectures and summarizes them into a single PDF file, implemented on a Jetson Nano.
GNU General Public License v3.0
0 stars 0 forks source link
onnx python3 pytorch tensorrt yolov7-tiny

GRIK

Hanyang University Graduation Project in 2024.

License

"This project is licensed under the terms of the GNU GPL-3.0 License."
Please make sure this license when you fork our project.

Motivation and Objectives

Motivation

In engineering lectures, handwritten notes such as graphs and diagrams are just as important as the script. Existing lecture summarization services convert audio into text and then use LLMs for summarization. This approach has led to the issue of summaries not reflecting the handwritten notes.

Objectives

The goal is to perform object detection on the handwritten notes and then process those sections to summarize them into a PDF.

Technical contributions

1. Baseline

We established a baseline for our project by conducting object detection with YOLOv7-tiny, trained for 100 epochs with a batch size of 8. The Structural Similarity Index (SSIM) from SciKit was employed to assess the quality of the images. Background removal from the handwritten notes was implemented using a nested for loop.

2. Skill/Knowledge

2.1 Dataset

Frames containing handwritten notes were extracted from certain lecture videos and club seminars. The handwritten sections were then bounded to create a custom dataset, which was developed using Roboflow.

2.2 Model

Transfer learning was conducted using the pre-trained weight file of YOLOv7-tiny on the custom dataset. To achieve results comparable to those obtained with YOLOv7, adjustments were made to the epoch and batch size parameters. It was determined that an epoch of 400 and a batch size of 32 were optimal settings.

3. Novelty

By utilizing the SSIM implementation from OpenCV instead of SciKit, we achieved a twofold improvement in performance. The implementation of YOLOv7-tiny was optimized through TensorRT, resulting in a reduction of processing time by more than tenfold. In the background removal process, rather than merely subtracting the original image from the handwritten image, we employed the subtracted image as a mask, thereby reducing the processing time per frame.

Implementation

The project leverages the computational power of the NVIDIA Jetson Nano board and its dedicated GPU to run deep learning models for object detection and speech recognition efficiently. The system is implemented using Python and various deep learning frameworks, such as TensorFlow and PyTorch.

Key Features

Object Detection

The system utilizes advanced object detection algorithms to identify and extract relevant visual information from the lecture videos, such as slides, diagrams, and handwritten notes.
This visual data is then incorporated into the generated PDF notes for better comprehension and context.

Summarization and Formatting

The transcribed text and extracted visual information are intelligently processed and summarized to create concise and well-strutured PDF notes. The notes are formatted with appropriate headings, bullet points, and visual aids, ensuring a clear and organized presentation of the lecture content.

Potential Applications

Educational Institutions

Lecture Summarizer can be a valuable tool for students and educators, enabling them to quickly review and revisit lecture content in a condensed and organized format.

Online Learning Platforms

The system can be integrated into online learning platforms to provide learners with comprehensive and easily digestible summaries of video lectures.

Corporate Training

Businesses can utilize Lecture Summarizer to create concise training materials from recorded sessions, facilitating knowledge transfer and employee development.

About Our Project

Why this project is named "GRIK"

We ofthen find that project code names are inspired by food items, such as Eclipse, Bread, and Honeycomb.
In our case, we named our project "GRIK" because of our fondness for Greek yogurt.
The name not only reflects our culinary preferences but also pays homage to the rich cultural heritage of Greece.

Dependency

How to run our project

Step 1. Download our project file.
Step 2. Open your terminal.
Step 3. Run python3 app.py

Implementation Mechanism

image [ Fig 01. Program Architecture Diagram ]

This project processes video files through a total of four stages. Initially, the video file is received in app.py, which passes the video storage path as a parameter to Imagehash.py. Subsequently, images are extracted using the pHash-based Hamming Distance to differentiate between the pre-annotated and post-annotated frames. The image paths are then forwarded to detection.py, where YOLOv7-tiny is employed to extract the annotated sections. The extracted image paths are passed to extract.py for background removal. Finally, all processed images are compiled into a single PDF using makePDF.py.

Step 1. Imagehash

Initially, the algorithm employed in this project was the Structural Similarity Index (SSIM). However, due to the excessive processing time encountered[1] during implementation on the Jetson Nano, we transitioned to using Imagehash for improved performance.

image [ Fig. 02. Runtime Comparison of Similarity Algorithms ]

Comparing the results, the runtime for Scikit's SSIM was 411 seconds for a 720p video at 11 minutes and 24 frames, while OpenCV achieved a runtime of 200 seconds. In contrast, Imagehash demonstrated a significantly reduced runtime of 124 seconds. The substantial decrease in runtime upon switching to Imagehash indicates that the algorithmic improvement was meaningful and beneficial.

image image [ Fig. 03. pHash Computation Process and Method for Calculating Hamming Distance ]

By employing this method, the project utilized dynamic programming (DP) to calculate the Hamming distance, thereby reducing the time complexity compared to SSIM.

image [ Fig. 04. Algorithm Improvement through the Application of Dynamic Programming (DP) ]

Step 2. Object Detection

To accurately recognize the annotations and extract the annotated sections, we determined that an Object Detection approach was appropriate. Consequently, we opted to use YOLOv7-tiny, which had been verified for feasibility on the Jetson Nano. However, a challenge arose as YOLOv7 was a model not trained on annotation data. To address this, we utilized the Roboflow platform to create a custom dataset by extracting frames from videos of seminars and external lectures. Subsequently, we conducted training in the Colab environment using hyperparameters.

image [ Fig. 05. Training Results by Hyperparameters ]

The training results indicated a trend of increasing accuracy with larger values for both epoch and batch size. In the final version, we utilized a model trained with 400 epochs and a batch size of 32.

image [ Fig. 06. Inference Time per Image Before Optimization ]

However, due to performance constraints on the Jetson Nano, the inference time per image was 28 seconds. This led us to conclude that GPU optimization was necessary. Consequently, we developed the following optimization algorithm.

image [ Fig. 07. Quantization and TensorRT Optimization Algorithm ]

Therefore, we chose to convert the model to ONNX format and apply TensorRT optimization. However, we encountered an issue with the 'NMS layer processing.' This layer is positioned before the output of YOLOv7, and the challenge arose because TensorRT does not support optimization for the NMS layer. To resolve this, we successfully adapted the method outlined in the official YOLOv7 GitHub repository for optimization.

image [ Fig. 08. Results After TensorRT Optimization ]

As a result, we were able to reduce the inference time from the original 28 seconds per image to 2.5 seconds. Notably, we attribute a significant portion of this improvement to the FP16 quantization implemented during the conversion to TensorRT.

Step 3. Background Removal

Initially, we aimed to extract the annotations from the images using Absolute Differentiation, comparing the post-annotation images with the pre-annotation images. However, we encountered an issue where the original annotation colors were distorted in areas overlapping with the background. To address this, we devised a method that utilized the Absolute Differentiated images as masks, applying them back to the annotation images. During this process, we found that using NumPy syntax significantly reduced the processing time per image, leading us to adopt this approach.

image [ Fig. 09. Background Removal Algorithm ]

Step 4. Make PDF File

image [ Fig. 10. PDF Generation Method ]

In this step, we utilized the images generated in steps 1 to 3 to construct the frames, which included the annotated frame, the annotation itself, and the frame without annotations.

Discussion and Conclusion

There are various metrics for calculating similarity. Among PSNR and SSIM, the SSIM method better reflects the differences perceived by the human eye. While SSIM is known to require less computational load than YOLOv7, we found that as image quality increases, the computational load grows exponentially. Although resizing images using OpenCV can reduce computational load, we anticipated that this resizing process would also increase latency. In this regard, choosing to use hashing to quantify structural differences was a good decision. Additionally, we discovered that optimizing the PyTorch model with TensorRT and using a masking method instead of nested loops can enhance computational speed.

Citation

[1] Fitri N. Rahayu, ulrich Reiter, etc, “Analysis of SSIM Performance for Digital Cinema Applications”, IEEE 978-1-4244-4370-3/09, 2009

BUILD History

Demo

You can see our demo video here