Student Chatbot
What is it
Chatbot to summarize uploaded pdf files.
Detailed report about the development process as well as some learnings can be found here.
Project Scope
Building a process that accepts user inputs as well as files, processes these files and search for user input.
The following image describes the process schematically:
An intuitive and interactive user interface is created using streamlit.
Text Extraction
- Python lib vs OCR
- python lib: structure based text extraction -> no format context
- OCR: ML based approach to additionally detect document structure for more complete information
- Custom OCR model based on LayoutParser
- Annotations done via LabelStudio
- Training done on local RTX 2070S GPU ~2h
- Shows promise, but not enough to provide a reliable benefit over simple python extraction
- Ended up with structure based extraction due to time / effort constraints in regards to building a fully functional custom OCR model
- Further instructions and implementation example here
Semantic Chunking
In order to improve the way how the text is splitted, we implemented a different approach. This approach tries to identifies chunk points based on semantics. Further explanaition and code can be found here
Models
- we tested three models using a open source lecture set from MIT -> Lecture Notes
- The Models are:
- Models are compaired and evaluated based on the following categories
- Evaluation and test results can be found here
Collaborators
- Lars Kurschilgen
- Nicholas Link
- Alexander Paul
- Adrian Setz
- Lucas Wätzig
- Jan Wolter
Requirements
- Clone Repository
- Create a virtual python environment -> Link
- Install required packages listed in requirements.txt
- The Application is build using streamlit. To run the app execute the following command in the projects directory
streamlit run app.py
- The app will open in a new browser tab. If not follow the link displayed in your terminal