The BytesIO and PdfReader classes from the PyPDF2 library are imported to handle the PDF file.
The CharacterTextSplitter, OpenAIEmbeddings, FAISS, load_qa_chain, OpenAI, and get_openai_callback classes are imported from the langchain library. These classes are used to build the question-answering system.
The OpenAI API key is set as an environment variable using the os module.
The st.set_page_config and st.header functions from the streamlit library are used to set the title and header of the web app.
The PDF file is loaded using the open function and read in binary mode using the rb flag. The contents of the file are then stored in a BytesIO object.
The text content of the PDF file is extracted using the extract_text method of the PdfReader class. The text is concatenated into a single string.
The CharacterTextSplitter class is used to split the text into smaller chunks. These chunks are used to build a knowledge base for the question-answering system.
The OpenAIEmbeddings class is used to generate embeddings for the text chunks. These embeddings are used to perform similarity searches when answering questions.
The st.text_input function is used to prompt the user to ask a question about the PDF file.
If the user enters a question, the similarity search is performed using the knowledge_base.similarity_search method. The resulting documents are passed to the load_qa_chain function to create a question-answering chain.
The run method of the question-answering chain is called with the input documents and user question as arguments. The result is stored in the response variable.
The result is displayed using the st.write function.
The code : app.py
from io import BytesIO
import requests
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
def main():
st.set_page_config(page_title="Ask your PDF")
st.header("Ask your PDF 💬")
# load the PDF file
url = 'https://www.example.com/example.pdf'
response = requests.get(url)
pdf = BytesIO(response.content)
# extract the text
pdf_reader = PdfReader(pdf)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# split into chunks
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
# create embeddings
embeddings = OpenAIEmbeddings()
knowledge_base = FAISS.from_texts(chunks, embeddings)
# show user input
user_question = st.text_input("Ask a question about your PDF:")
if user_question:
docs = knowledge_base.similarity_search(user_question)
llm = OpenAI()
chain = load_qa_chain(llm, chain_type="stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=user_question)
print(cb)
st.write(response)
if __name__ == '__main__':
main()
Autoload the PDF in local file
With this feature, we delete the file input
Explaincation :
The code : app.py
Run & Test
streamlit run .\app.py