Student Chatbot

What is it

Chatbot to summarize uploaded pdf files.

Detailed report about the development process as well as some learnings can be found here.

Project Scope

Building a process that accepts user inputs as well as files, processes these files and search for user input. The following image describes the process schematically:

An intuitive and interactive user interface is created using streamlit.

Text Extraction

Python lib vs OCR
- python lib: structure based text extraction -> no format context
- OCR: ML based approach to additionally detect document structure for more complete information
Custom OCR model based on LayoutParser
Annotations done via LabelStudio
Training done on local RTX 2070S GPU ~2h
Shows promise, but not enough to provide a reliable benefit over simple python extraction
Ended up with structure based extraction due to time / effort constraints in regards to building a fully functional custom OCR model
Further instructions and implementation example here

Semantic Chunking

In order to improve the way how the text is splitted, we implemented a different approach. This approach tries to identifies chunk points based on semantics. Further explanaition and code can be found here

Models

we tested three models using a open source lecture set from MIT -> Lecture Notes
The Models are:
Models are compaired and evaluated based on the following categories
Evaluation and test results can be found here

Collaborators

Lars Kurschilgen
Nicholas Link
Alexander Paul
Adrian Setz
Lucas Wätzig
Jan Wolter

Requirements

Clone Repository
Create a virtual python environment -> Link
Install required packages listed in requirements.txt
The Application is build using streamlit. To run the app execute the following command in the projects directory streamlit run app.py
The app will open in a new browser tab. If not follow the link displayed in your terminal

LWaetzig / StudentChatbot

readme