J-Gann / medical-rag-chatbot

Project Repository for the class "Natural Language Processing with Transformers" 2023 at Heidelberg University
0 stars 0 forks source link

Medical Chatbot using finetuned LLM and RAG on Pubmed dataset

GitHub Handle E-Mail Course of Study Matriculation Number
Jonas Gann @J-Gann gann@stud.uni-heidelberg.de Data and Computer Science 3367576
Christian Teutsch @chTeut christian.teutsch@outlook.de Data and Computer Science 3729420
Saif Mandour @saifmandour saifmandour@gmail.com Computer Science 4189231

Advisor: Robin Khanna

This repository contains a medical chatbot using a finetuned LLM and RAG system on a Pubmed dataset.

See the Documentation for more information.

Installation and Running

Requirements

Setup

Starting

Access

Repository Overview

The main components of the repository are:

User Interface: ChatUI^1

Here you can find the user interface for the chatbot based on the open-source project. We expanded the project to include a RAG system inserting the content of scientific papers retrieved from the Pubmed dataset.

RAG System: RAG

Here you can find the RAG system used for the chatbot. It provides an endpoint for the chat-ui to query the RAG system for papers relevant to questions posed by the user. The retrieved papers are inserted into the user prompt at buildPrompt.ts.

Data Preprocessing: Preprocessing

This notebook containes the code we used to retrieve and process the Pubmed dataset as well as upload embeddings of the papers to the Pinecone vectorstore.

System Evaluation: Evaluation

This folder contains the code and results of the evaluation of the chatbot system.

Meetings: Meetings

This folder contains the notes of the meetings we had during the project.

Notes: Notes

This folder contains the notes we took during the project.

Opensearch: Opensearch

The OpenSearch Vectorbase runs on a localhost. The pubmed_preprocessing.ipynb notebook can be used to preprocess the PubMed data, creating an index and bulk loading the data into the Vectorbase. It also provides an index mapping to create a k-NN search. The k-NN can be tested in the last code section. This code is also used in the opensearchEndpoint.py. To use OpenSearch instead of Pinceone Vector Database, the .env file must be modified. The Rag attribute in the MODELS variable must be changed to "vectorStoreType": "opensearch" and "url": "http://127.0.0.1:9300".