Closed andyzhangwp closed 3 weeks ago
Hey @andyzhangwp! 👋 I'm here to help you with any bugs, questions, or contributions you have while we wait for a human maintainer. Let's squash those bugs together!
Yes, the program can automatically skip the download step and avoid the exception if the ms-marco-TinyBERT-L-2-v2.zip
file is manually downloaded and copied to the server. The file needs to be copied to the directory where the program expects to find it, which is the current working directory of the script.
The relevant code snippet indicates that the program uses the FlashrankRerank
model with ms-marco-TinyBERT-L-2-v2
:
compressor = FlashrankRerank(model="ms-marco-TinyBERT-L-2-v2", top_n=20)
Ensure that the ms-marco-TinyBERT-L-2-v2.zip
file is placed in the current working directory of the script to avoid the download step and the associated exception [1].
To continue talking to Dosu, mention @dosu.
Try the following steps, but the problem still persists: Step 1: Copy ms-marco-TinyBERT-L-2-v2.zip to these directories (/usr/local/Quivr/backend/models,/usr/local/Quivr/backend/modules/brain/rags) Step 2: Restart the application, docker-compose -up --build Which directory should I copy?
To ensure the application recognizes the manually downloaded ms-marco-TinyBERT-L-2-v2.zip
file and bypasses the download step, you should place the file in the directory where the application expects to find it. Based on the provided information, the current working directory of the script is likely the most appropriate location.
Here are the steps you should follow:
Copy the ms-marco-TinyBERT-L-2-v2.zip
file to the current working directory of the script. This is typically the root directory of the backend service in your Docker container.
Modify the _create_llm
method in the QuivrRAG
class to ensure it uses the manually placed model file. You can adjust the api_base
or the model
parameter to point to the local path of the manually placed model file.
Here is the relevant part of the code that needs to be modified:
def _create_llm(
self,
callbacks,
model,
streaming=False,
temperature=0,
) -> BaseLLM:
"""
Create a LLM with the given parameters
"""
if streaming and callbacks is None:
raise ValueError(
"Callbacks must be provided when using streaming language models"
)
api_base = None
if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
api_base = (
self.brain_settings.ollama_api_base_url # pyright: ignore reportPrivateUsage=none
)
return ChatLiteLLM(
temperature=temperature,
max_tokens=self.max_tokens,
model=model,
streaming=streaming,
verbose=False,
callbacks=callbacks,
api_base=api_base,
) # pyright: ignore reportPrivateUsage=none
docker-compose up --build
By following these steps, you should be able to bypass the download step and avoid the exception related to the connection timeout [1].
To continue talking to Dosu, mention @dosu.
I have modified the following code, but it still has the same issue,
def _create_llm(
self,
callbacks,
model="/code/ms-marco-TinyBERT-L-2-v2.zip",
streaming=False,
temperature=0,
) -> BaseLLM:
"""
Create a LLM with the given parameters
"""
if streaming and callbacks is None:
raise ValueError(
"Callbacks must be provided when using streaming language models"
)
api_base = None
if self.brain_settings.ollama_api_base_url: # and model.startswith("ollama"):
api_base = self.brain_settings.ollama_api_base_url
The issue might be due to the way the model path is being handled. The model
parameter in the _create_llm
method is typically expected to be a model name rather than a file path. To ensure the application uses the local model file, you need to ensure that the model loading mechanism is correctly configured to recognize and use the local file.
Here are the steps to correctly configure the local model file:
Place the Model File in the Correct Directory:
Ensure that the ms-marco-TinyBERT-L-2-v2.zip
file is placed in a directory that the application can access. A common practice is to place it in a directory like /code/models/
.
Modify the Model Loading Mechanism:
Instead of modifying the _create_llm
method directly, you should ensure that the model loading mechanism within the application is configured to use the local file. This might involve modifying the part of the code where the model is loaded.
Update the Environment Configuration: Ensure that the environment variables and configuration files are set up to point to the local model file.
Here is an example of how you might modify the code to load the model from a local file:
def _create_llm(
self,
callbacks,
model="ms-marco-TinyBERT-L-2-v2",
streaming=False,
temperature=0,
) -> BaseLLM:
"""
Create a LLM with the given parameters
"""
if streaming and callbacks is None:
raise ValueError(
"Callbacks must be provided when using streaming language models"
)
api_base = None
if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
api_base = self.brain_settings.ollama_api_base_url
# Check if the model file exists locally
local_model_path = "/code/models/ms-marco-TinyBERT-L-2-v2.zip"
if os.path.exists(local_model_path):
model = local_model_path
return ChatLiteLLM(
temperature=temperature,
max_tokens=self.max_tokens,
model=model,
streaming=streaming,
verbose=False,
callbacks=callbacks,
api_base=api_base,
)
docker-compose up --build
By ensuring that the model file is correctly placed and the application is configured to use the local file, you should be able to bypass the download step and avoid the connection timeout issue.
To continue talking to Dosu, mention @dosu.
We don't need to modify the code, we just need to copy the downloaded model file to an accessible location, specifically/tmp in the Docker application container, and decompress it. This allows for chatting based on the brain. However, this seems to be very resource intensive, with high CPU, IOPS, and BPS when running on a 2-core 8G server, and a chat taking more than 10 minutes. Is there any way to optimize it?
To optimize the performance of the Ollama model running on a local server, you can follow these strategies:
Enable Streaming: Streaming can help reduce response times by sending partial results as they are generated. Ensure that the streaming
parameter is set to True
.
def _determine_streaming(self, streaming: bool) -> bool:
"""If the model name allows for streaming and streaming is declared, set streaming to True."""
return streaming
def _determine_callback_array(
self, streaming
) -> List[AsyncIteratorCallbackHandler]:
"""If streaming is set, set the AsyncIteratorCallbackHandler as the only callback."""
if streaming:
return [AsyncIteratorCallbackHandler()]
else:
return []
Adjust Model Parameters: Tuning parameters such as temperature
and max_tokens
can help optimize performance. Lowering the max_tokens
can reduce the computational load.
class HeadlessQA(BaseModel, QAInterface):
brain_settings = BrainSettings()
model: str
temperature: float = 0.0
max_tokens: int = 2000
streaming: bool = False
chat_id: str
callbacks: Optional[List[AsyncIteratorCallbackHandler]] = None
prompt_id: Optional[UUID] = None
Use Efficient Models: Ensure that the model being used is optimized for performance. The ChatLiteLLM
is designed to be lightweight.
def _create_llm(
self,
model,
temperature=0,
streaming=False,
callbacks=None,
) -> BaseChatModel:
"""
Determine the language model to be used.
:param model: Language model name to be used.
:param streaming: Whether to enable streaming of the model
:param callbacks: Callbacks to be used for streaming
:return: Language model instance
"""
api_base = None
if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
api_base = self.brain_settings.ollama_api_base_url
return ChatLiteLLM(
temperature=temperature,
model=model,
streaming=streaming,
verbose=True,
callbacks=callbacks,
max_tokens=self.max_tokens,
api_base=api_base,
)
Optimize Chat History Handling: Efficiently format and handle chat history to reduce overhead.
def generate_answer(
self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
) -> GetChatHistoryOutput:
transformed_history = format_chat_history(
chat_service.get_chat_history(self.chat_id)
)
prompt_content = (
self.prompt_to_use.content if self.prompt_to_use else SYSTEM_MESSAGE
)
messages = format_history_to_openai_mesages(
transformed_history, prompt_content, question.question
)
answering_llm = self._create_llm(
model=self.model,
streaming=False,
callbacks=self.callbacks,
)
model_prediction = answering_llm.predict_messages(messages)
answer = model_prediction.content
if save_answer:
new_chat = chat_service.update_chat_history(
CreateChatHistory(
**{
"chat_id": chat_id,
"user_message": question.question,
"assistant": answer,
"brain_id": None,
"prompt_id": self.prompt_to_use_id,
}
)
)
return GetChatHistoryOutput(
**{
"chat_id": chat_id,
"user_message": question.question,
"assistant": answer,
"message_time": new_chat.message_time,
"prompt_title": (
self.prompt_to_use.title if self.prompt_to_use else None
),
"brain_name": None,
"message_id": new_chat.message_id,
}
)
Docker Resource Limits: Ensure that your Docker container has appropriate resource limits set. You can specify CPU and memory limits in your Docker Compose file.
services:
quivr-backend:
deploy:
resources:
limits:
cpus: '1.0'
memory: 2700M
Parallel Processing: Utilize parallel processing and multi-threading to optimize resource usage and response times. The quivr
repository uses Celery and Uvicorn for this purpose.
services:
quivr-backend:
command: uvicorn main:app --reload --host 0.0.0.0 --port 5050 --workers 6 --log-level info
quivr-worker:
command: celery -A celery_worker worker -l info
quivr-beat:
command: celery -A celery_worker beat -l info
By implementing these strategies, you can optimize the performance of the Ollama model running on a local server [1][2].
To continue talking to Dosu, mention @dosu.
Hi Dear, After optimizing using the above method, there hasn't been much improvement in performance. Perhaps a better performing server is needed to run smoothly. I think if we change llama 7b to a smaller model to run, such as qwen, the performance may be better. So I ran the qwen: 1.8b model locally through olama run qwen: 1.8b. And change the brain model from Ollama/llma2 to Ollama/qwen: 1.8b (attempting to change to Ollama/qwen indicates that the model cannot be found). After completing these tasks, @ the brain engaged in a conversation, but later discovered a phenomenon that the answer content did not come from the uploaded PDF document in the brain, which does not match the established conversation. How should we solve this problem?
To resolve the issue of the answers not being derived from the uploaded PDF document after switching to the qwen: 1.8b model, you should ensure that the similarity_search
function in the CustomSupabaseVectorStore
is correctly configured to work with the new model. The issue might be due to the embeddings or the vector store configuration not being compatible with the new model.
Here is the relevant part of the code that performs the similarity search:
vector_store = CustomSupabaseVectorStore(
supabase_client,
embeddings,
table_name="vectors",
brain_id=str(brain_id),
number_docs=20,
)
documents = vector_store.similarity_search(question, k=20, threshold=0.8)
Make sure that:
embeddings
are generated using the qwen: 1.8b model.CustomSupabaseVectorStore
is properly initialized with these embeddings.similarity_search
function is compatible with the new model's embeddings.Additionally, ensure that the PDF processing is correctly handled by the UnstructuredPDFLoader
as shown in the process_pdf
function:
from langchain_community.document_loaders import UnstructuredPDFLoader
from models import File
from .common import process_file
def process_pdf(
file: File, brain_id, original_file_name, integration=None, integration_link=None
):
return process_file(
file=file,
loader_class=UnstructuredPDFLoader,
brain_id=brain_id,
original_file_name=original_file_name,
integration=integration,
integration_link=integration_link,
)
Ensure that the PDF documents are being correctly processed and loaded into the vector store using the UnstructuredPDFLoader
. This will help in making sure that the documents are available for the similarity search and the answers are derived correctly.
Additionally, you should verify that the retrieval and answer generation process is correctly set up in the QuivrRAG
class. Here is an example of how the documents are retrieved and the final answer is generated:
compressor = None
if os.getenv("COHERE_API_KEY"):
compressor = CohereRerank(top_n=20)
else:
compressor = FlashrankRerank(model="ms-marco-TinyBERT-L-2-v2", top_n=20)
retriever_doc = self.get_retriever()
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever_doc
)
loaded_memory = RunnablePassthrough.assign(
chat_history=RunnableLambda(
lambda x: self.filter_history(x["chat_history"]),
),
question=lambda x: x["question"],
)
api_base = None
if self.brain_settings.ollama_api_base_url and self.model.startswith("ollama"):
api_base = self.brain_settings.ollama_api_base_url
standalone_question = {
"standalone_question": {
"question": lambda x: x["question"],
"chat_history": itemgetter("chat_history"),
}
| CONDENSE_QUESTION_PROMPT
| ChatLiteLLM(temperature=0, model=self.model, api_base=api_base)
| StrOutputParser(),
}
prompt_custom_user = self.prompt_to_use()
prompt_to_use = "None"
if prompt_custom_user:
prompt_to_use = prompt_custom_user.content
# Now we retrieve the documents
retrieved_documents = {
"docs": itemgetter("standalone_question") | compression_retriever,
"question": lambda x: x["standalone_question"],
"custom_instructions": lambda x: prompt_to_use,
}
final_inputs = {
"context": lambda x: self._combine_documents(x["docs"]),
"question": itemgetter("question"),
"custom_instructions": itemgetter("custom_instructions"),
"files": lambda x: list_files,
}
llm = ChatLiteLLM(
max_tokens=self.max_tokens,
model=self.model,
temperature=self.temperature,
api_base=api_base,
) # pyright: ignore reportPrivateUsage=none
if self.model_compatible_with_function_calling():
# And finally, we do the part that returns the answers
llm_function = ChatOpenAI(
max_tokens=self.max_tokens,
model=self.model,
temperature=self.temperature,
)
llm = llm_function.bind_tools(
[cited_answer],
tool_choice="cited_answer",
)
answer = {
"answer": final_inputs | ANSWER_PROMPT | llm,
"docs": itemgetter("docs"),
}
return loaded_memory | standalone_question | retrieved_documents | answer
Ensure that the retrieval and answer generation process is correctly set up and that the documents are being correctly retrieved and combined to form the context for the final answer. This will help in making sure that the answers are derived correctly from the uploaded PDF documents [1][2][3].
To continue talking to Dosu, mention @dosu.
Does it mean that the uploaded PDF file was not processed correctly, even before switching models, as neither vector nor brain_vector tables were recorded in the data? How should this situation be resolved?
Yes, if the vector
and brain_vector
tables are not recorded in the data, it likely means that the uploaded PDF file has not been processed correctly. To resolve this situation, you should ensure that the PDF file is processed correctly and the vectors are recorded in the respective tables.
Here are the steps to ensure proper processing:
Use the process_pdf
Function:
The process_pdf
function in backend/packages/files/parsers/pdf.py
should be used to process the PDF file. This function utilizes the UnstructuredPDFLoader
to load the PDF and then processes it using the process_file
function.
from langchain_community.document_loaders import UnstructuredPDFLoader
from models import File
from .common import process_file
def process_pdf(
file: File, brain_id, original_file_name, integration=None, integration_link=None
):
return process_file(
file=file,
loader_class=UnstructuredPDFLoader,
brain_id=brain_id,
original_file_name=original_file_name,
integration=integration,
integration_link=integration_link,
)
Ensure Proper Implementation of process_file
:
The process_file
function in backend/packages/files/parsers/common.py
handles the detailed processing of the file, including splitting the document into chunks, encoding the text, and creating vectors.
import asyncio
import os
import tempfile
import time
import nest_asyncio
import tiktoken
import uvloop
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_parse import LlamaParse
from logger import get_logger
from models import File
from modules.brain.service.brain_vector_service import BrainVectorService
from modules.upload.service.upload_file import DocumentSerializable
from packages.embeddings.vectors import Neurons
if not isinstance(asyncio.get_event_loop(), uvloop.Loop):
nest_asyncio.apply()
logger = get_logger(__name__)
async def process_file(
file: File,
loader_class,
brain_id,
original_file_name,
integration=None,
integration_link=None,
):
dateshort = time.strftime("%Y%m%d")
neurons = Neurons()
if os.getenv("LLAMA_CLOUD_API_KEY"):
doc = file.file
document_ext = os.path.splitext(doc.filename)[1]
if document_ext in [".pdf", ".docx", ".doc"]:
document_tmp = tempfile.NamedTemporaryFile(
suffix=document_ext, delete=False
)
doc.file.seek(0)
document_tmp.write(doc.file.read())
parser = LlamaParse(
result_type="markdown",
parsing_instruction="Extract the tables and transform checkboxes into text. Transform tables to key = value. You can duplicates Keys if needed. For example: Productions Fonts = 300 productions Fonts Company Desktop License = Yes for Maximum of 60 Licensed Desktop users For example checkboxes should be: Premium Activated = Yes License Premier = No If a checkbox is present for a table with multiple options. Say Yes for the one activated and no for the one not activated. Format using headers.",
gpt4o_mode=True,
gpt4o_api_key=os.getenv("OPENAI_API_KEY"),
)
document_llama_parsed = parser.load_data(document_tmp.name)
document_tmp.close()
document_to_langchain = document_llama_parsed[0].to_langchain_format()
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=file.chunk_size, chunk_overlap=file.chunk_overlap
)
document_to_langchain = Document(
page_content=document_to_langchain.page_content
)
file.documents = text_splitter.split_documents([document_to_langchain])
else:
file.compute_documents(loader_class)
metadata = {
"file_sha1": file.file_sha1,
"file_size": file.file_size,
"file_name": file.file_name,
"chunk_size": file.chunk_size,
"chunk_overlap": file.chunk_overlap,
"date": dateshort,
"original_file_name": original_file_name or file.file_name,
"integration": integration or "",
"integration_link": integration_link or "",
}
docs = []
enc = tiktoken.get_encoding("cl100k_base")
if file.documents is not None:
for index, doc in enumerate(file.documents, start=1):
new_metadata = metadata.copy()
doc.page_content = f"Filename: {new_metadata['original_file_name']} Content: {doc.page_content}"
doc.page_content = doc.page_content.replace("\u0000", "")
len_chunk = len(enc.encode(doc.page_content))
doc.page_content = doc.page_content.encode("utf-8", "replace").decode("utf-8")
new_metadata["chunk_size"] = len_chunk
new_metadata["index"] = index
doc_with_metadata = DocumentSerializable(
page_content=doc.page_content, metadata=new_metadata
)
docs.append(doc_with_metadata)
created_vector = neurons.create_vector(docs)
brain_vector_service = BrainVectorService(brain_id)
for created_vector_id in created_vector:
result = brain_vector_service.create_brain_vector(
created_vector_id, metadata["file_sha1"]
)
logger.debug(f"Brain vector created: {result}")
if created_vector:
return len(created_vector)
else:
return 0
Check the Neurons
Class:
Ensure that the Neurons
class in backend/packages/embeddings/vectors.py
is correctly creating vectors. The create_vector
method should add documents to the vector store and return the resulting IDs.
class Neurons(BaseModel):
def create_vector(self, docs):
documents_vector_store = get_documents_vector_store()
try:
sids = documents_vector_store.add_documents(docs)
if sids and len(sids) > 0:
return sids
except Exception as e:
logger.error(f"Error creating vector for document {e}")
Verify Database Interactions:
Ensure that the BrainsVectors
class in backend/modules/brain/repository/brains_vectors.py
is correctly interacting with the vector
and brain_vector
tables. The create_brain_vector
method should correctly insert records into the brains_vectors
table.
class BrainsVectors(BrainsVectorsInterface):
def __init__(self):
supabase_client = get_supabase_client()
self.db = supabase_client
def create_brain_vector(self, brain_id, vector_id, file_sha1):
response = (
self.db.table("brains_vectors")
.insert(
{
"brain_id": str(brain_id),
"vector_id": str(vector_id),
"file_sha1": file_sha1,
}
)
.execute()
)
return response.data
By following these steps, you can ensure that the PDF file is processed correctly and the vectors are recorded in the vector
and brain_vector
tables [1][2][3][4].
To continue talking to Dosu, mention @dosu.
The process_file function in backend/packages/files/parsers/common.py:
The issue with the process_file
function not entering the condition if file.documents is not None
is indeed related to the environment variable LLAMA_CLOUD_API_KEY
. The function checks for the presence of this environment variable to decide whether to process the file using the LlamaParse parser or to use the compute_documents
method.
To ensure the documents are processed and vectorized correctly, you need to set the LLAMA_CLOUD_API_KEY
environment variable. You can set this environment variable in your operating system or within your application. Here is an example of how to set it in a Unix-based system:
export LLAMA_CLOUD_API_KEY=your_api_key_here
Or, if you are using a Python script, you can set it at the beginning of your script:
import os
os.environ['LLAMA_CLOUD_API_KEY'] = 'your_api_key_here'
If the issue persists even after setting the environment variable, you should ensure that the file.documents
attribute is being correctly assigned within the process_file
function. This involves verifying that the text_splitter.split_documents
method is functioning as expected and that the file.documents
attribute is being populated correctly [1].
To continue talking to Dosu, mention @dosu.
After configuring LLAMA-CLOUD-API_KEY, the vectors and brain_vectors tables now have data, But the answer to the conversation still doesn't come from the brain's PDF file. I tried to print the log of QuivrRAG answers, and from the log, I can see that the switched model (olama/qwen: 1.8b) has already been used, and no other configuration issues have been found. I am not very clear about the specific processing flow of the answer. How can I further investigate the problem and find out the real reason why the answer is not from the brain's PDF file?
code: ...... print(f"========xxxxxxxxxxxx=====get_chain, loaded_memory:{loaded_memory} , standalone_question: {standalone_question}, retrieved_documents:{retrieved_documents}, answer:{answer}") return loaded_memory | standalone_question | retrieved_documents | answer
The log is as follows:
========QuivrRAG. get_chain, loaded_memory: mapper={
Backend core | chat_history: RunnableLambda (lambda x: self. filter_history (x ['chat_history ')),
Backend core | question: RunnableLambda (lambda x: x ['question '])
Backend core |}, standalone question: {'standalone_question ':{
Backend core | question: RunnableLambda (...),
Backend core | chat_history: RunnableLambda (itemmeter ('chat_history '))
backend-core | }
Backend core | | PromptTemplate (input_variables=['chat_history ',' question '], template='Give the following conversation and a following up question, repeat the following up question to be a standalone question, in its original language. Keep as many details as possible from previous messages. Keep entity names and all. \ n \ nChat History: \ n {chat_history} \ nFollow Up Input: {question} \ nStandalone question:')
Backend core | | ChatLiteLLM (client=<module 'litellm' from '/usr/local/lib/python3.11/site packages/litellm/ init. py'>, model='ollama/qwen:1.8b', openai_api_key='tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ', azure_api_key='', anthropic_api_key='null', replicate_api_key='', cohere_api_key='', openrouter_api_key='', api_base=' http://172.17.0.1:11434 , temperature=0.0, huggingface_api_key='', together_ai_api_key='')
Backend core | | StrOutputParser()}, retrieved documents: {'docs': RunnableLambda (itemgetter ('standalone_question '))
Backend core | | ContextualCompressionRetriever (base_compressor=FlashrankRetriever (client=<flashrank. Ranker. Ranker object at 0x7f23b931af90>, top-n=10, model='ms marco TinyBERT-L-2-v2 '), base-retriever=VectorStoreRetriever (tags=['CustomiSupabaseVectorStore', 'OllamaEmbedding'], vectorstore=<vectorstore. subpabase. CustomiSupabase VectorStore object at 0x7f23d86a1310), 'question':<function QuivrRAG. get_chain.
To further investigate the problem and find out the real reason why the answer is not coming from the brain's PDF file, you should start by checking the process_file
function in the common
module. The process_pdf
function relies on process_file
to handle the PDF processing using the UnstructuredPDFLoader
.
Here is the relevant code snippet from backend/packages/files/parsers/pdf.py
:
from langchain_community.document_loaders import UnstructuredPDFLoader
from models import File
from .common import process_file
def process_pdf(
file: File,
brain_id,
original_file_name,
integration=None,
integration_link=None,
):
return process_file(
file=file,
loader_class=UnstructuredPDFLoader,
brain_id=brain_id,
original_file_name=original_file_name,
integration=integration,
integration_link=integration_link,
)
You should verify the following:
process_file
function: Ensure that it correctly processes the file and handles the UnstructuredPDFLoader
.UnstructuredPDFLoader
: Make sure that this loader is correctly implemented and capable of extracting the necessary information from the PDF.process_file
function to trace the flow of data and identify where it might be failing.Additionally, you should inspect the CompositeBrainQA
class in backend/modules/brain/composite_brain_qa.py
to ensure that the document retrieval and question answering processes are functioning correctly. Here is a relevant snippet:
class CompositeBrainQA(
KnowledgeBrainQA,
):
user_id: UUID
def __init__(
self,
model: str,
brain_id: str,
chat_id: str,
streaming: bool = False,
prompt_id: Optional[UUID] = None,
**kwargs,
):
user_id = kwargs.get("user_id")
if not user_id:
raise HTTPException(status_code=400, detail="Cannot find user id")
super().__init__(
model=model,
brain_id=brain_id,
chat_id=chat_id,
streaming=streaming,
prompt_id=prompt_id,
**kwargs,
)
self.user_id = user_id
def get_answer_generator_from_brain_type(self, brain: BrainEntity):
if brain.brain_type == BrainType.COMPOSITE:
return self.generate_answer
elif brain.brain_type == BrainType.API:
return APIBrainQA(
brain_id=str(brain.id),
chat_id=self.chat_id,
model=self.model,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
user_id=str(self.user_id),
raw=brain.raw,
jq_instructions=brain.jq_instructions,
).generate_answer
elif brain.brain_type == BrainType.DOC:
return KnowledgeBrainQA(
brain_id=str(brain.id),
chat_id=self.chat_id,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
).generate_answer
def generate_answer(
self, chat_id: UUID, question: ChatQuestion, save_answer: bool
) -> str:
brain = brain_service.get_brain_by_id(question.brain_id)
connected_brains = brain_service.get_connected_brains(self.brain_id)
if not connected_brains:
response = HeadlessQA(
chat_id=chat_id,
model=self.model,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
).generate_answer(chat_id, question, save_answer=False)
if save_answer:
new_chat = chat_service.update_chat_history(
CreateChatHistory(
**{
"chat_id": chat_id,
"user_message": question.question,
"assistant": response.assistant,
"brain_id": question.brain_id,
"prompt_id": self.prompt_to_use_id,
}
)
)
return GetChatHistoryOutput(
**{
"chat_id": chat_id,
"user_message": question.question,
"assistant": response.assistant,
"message_time": new_chat.message_time,
"prompt_title": (
self.prompt_to_use.title if self.prompt_to_use else None
),
"brain_name": brain.name,
"message_id": new_chat.message_id,
"brain_id": str(brain.id),
}
)
return GetChatHistoryOutput(
**{
"chat_id": chat_id,
"user_message": question.question,
"assistant": response.assistant,
"message_time": None,
"prompt_title": (
self.prompt_to_use.title if self.prompt_to_use else None
),
"brain_name": brain.name,
"message_id": None,
"brain_id": str(brain.id),
}
)
tools = []
available_functions = {}
connected_brains_details = {}
for connected_brain_id in connected_brains:
connected_brain = brain_service.get_brain_by_id(connected_brain_id)
if connected_brain is None:
continue
tools.append(format_brain_to_tool(connected_brain))
available_functions[connected_brain_id] = (
self.get_answer_generator_from_brain_type(connected_brain)
)
connected_brains_details[str(connected_brain.id)] = connected_brain
CHOOSE_BRAIN_FROM_TOOLS_PROMPT = (
"Based on the provided user content, find the most appropriate tools to answer"
+ "If you can't find any tool to answer and only then, and if you can answer without using any tool. In that case, let the user know that you are not using any particular brain (i.e tool) "
)
messages = [{"role": "system", "content": CHOOSE_BRAIN_FROM_TOOLS_PROMPT}]
history = chat_service.get_chat_history(self.chat_id)
for message in history:
formatted_message = [
{"role": "user", "content": message.user_message},
{"role": "assistant", "content": message.assistant},
]
messages.extend(formatted_message)
messages.append({"role": "user", "content": question.question})
response = completion(
model="gpt-3.5-turbo-0125",
messages=messages,
tools=tools,
tool_choice="auto",
)
brain_completion_output = self.make_recursive_tool_calls(
messages,
question,
chat_id,
tools,
available_functions,
recursive_count=0,
last_completion_response=response.choices[0],
)
if brain_completion_output:
answer = brain_completion_output.response.message.content
new_chat = None
if save_answer:
new_chat = chat_service.update_chat_history(
CreateChatHistory(
**{
"chat_id": chat_id,
"user_message": question.question,
"assistant": answer,
"brain_id": question.brain_id,
"prompt_id": self.prompt_to_use_id,
}
)
)
return GetChatHistoryOutput(
**{
"chat_id": chat_id,
"user_message": question.question,
"assistant": brain_completion_output.response.message.content,
"message_time": new_chat.message_time if new_chat else None,
"prompt_title": (
self.prompt_to_use.title if self.prompt_to_use else None
),
"brain_name": brain.name if brain else None,
"message_id": new_chat.message_id if new_chat else None,
"brain_id": str(brain.id) if brain else None,
}
)
Ensure that the generate_answer
method correctly retrieves and processes the documents from the brain's knowledge base. Verify that the brain_service
is correctly fetching the PDF files and that the make_recursive_tool_calls
method is functioning as expected.
Additionally, you should inspect the get_question_context_from_brain
function in backend/modules/brain/service/get_question_context_from_brain.py
to ensure that the vector store is correctly searching for and retrieving relevant documents based on the question. Here is the relevant snippet:
from uuid import UUID
from attr import dataclass
from logger import get_logger
from models.settings import get_embeddings, get_supabase_client
from modules.upload.service.generate_file_signed_url import generate_file_signed_url
from vectorstore.supabase import CustomSupabaseVectorStore
logger = get_logger(__name__)
@dataclass
class DocumentAnswer:
file_name: str
file_sha1: str
file_size: int
file_url: str = ""
file_id: str = ""
file_similarity: float = 0.0
def get_question_context_from_brain(brain_id: UUID, question: str) -> str:
"""Finds the best brain to answer the question based on the question's meaning.
Args:
brain_id (UUID): Id of the brain to search in
question (str): Question to search for in the vector store
Returns:
str: _descripton_
"""
# TODO: Move to AnswerGenerator service
supabase_client = get_supabase_client()
embeddings = get_embeddings()
vector_store = CustomSupabaseVectorStore(
supabase_client,
embeddings,
table_name="vectors",
brain_id=str(brain_id),
number_docs=20,
)
documents = vector_store.similarity_search(question, k=20, threshold=0.8)
answers = []
file_sha1s = []
for document in documents:
if document.metadata["file_sha1"] not in file_sha1s:
file_sha1s.append(document.metadata["file_sha1"])
file_path_in_storage = f"{brain_id}/{document.metadata['file_name']}"
answers.append(
DocumentAnswer(
file_name=document.metadata["file_name"],
file_sha1=document.metadata["file_sha1"],
file_size=document.metadata["file_size"],
file_id=document.metadata["id"],
file_similarity=document.metadata["similarity"],
file_url=generate_file_signed_url(file_path_in_storage).get(
"signedURL", ""
),
),
)
return answers
Ensure that the get_question_context_from_brain
function is correctly retrieving documents from the vector store and that the CustomSupabaseVectorStore
is functioning as expected. Verify that the documents retrieved are relevant to the question and that the PDF files are being correctly processed and included in the search results [1][2][3][4].
To continue talking to Dosu, mention @dosu.
I added print logs at the entrance of the functions in CompositeBrainQA for debugging. However, when the front-end already displayed a reply, the newly added logs were not printed, and I felt like I had skipped processing this class. Is there any other branch processing available? (The current conversation's brain type is doc)
Yes, there is another branch processing available in CompositeBrainQA
that could cause the processing to skip this class when the brain type is doc
. Specifically, in the get_answer_generator_from_brain_type
method, if the brain.brain_type
is BrainType.DOC
, the method returns the generate_answer
method from the KnowledgeBrainQA
class instead of the CompositeBrainQA
class.
Here is the relevant code snippet:
def get_answer_generator_from_brain_type(self, brain: BrainEntity):
if brain.brain_type == BrainType.COMPOSITE:
return self.generate_answer
elif brain.brain_type == BrainType.API:
return APIBrainQA(
brain_id=str(brain.id),
chat_id=self.chat_id,
model=self.model,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
user_id=str(self.user_id),
raw=brain.raw,
jq_instructions=brain.jq_instructions,
).generate_answer
elif brain.brain_type == BrainType.DOC:
return KnowledgeBrainQA(
brain_id=str(brain.id),
chat_id=self.chat_id,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
).generate_answer
In this code, when brain.brain_type
is BrainType.DOC
, the generate_answer
method from KnowledgeBrainQA
is used, effectively skipping the CompositeBrainQA
class [1].
To continue talking to Dosu, mention @dosu.
I manually modified the type to doc in the database because I found that there was no place to set other types when creating the brain in the front-end, only one defaulted to integration. What are the differences between these four types of brain (doc, API, composite, integration)? Is it related to the issue that the reply mentioned above did not originate from the document?
The differences between the four types of brain (doc
, API
, composite
, integration
) are as follows:
DOC: This type of brain uses the KnowledgeBrainQA
class to generate answers. It is designed to work with documents and knowledge bases to provide responses based on the content of those documents.
elif brain.brain_type == BrainType.DOC:
return KnowledgeBrainQA(
brain_id=str(brain.id),
chat_id=self.chat_id,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
).generate_answer
API: This type of brain uses the APIBrainQA
class to generate answers. It is designed to interact with external APIs to fetch and provide responses.
elif brain.brain_type == BrainType.API:
return APIBrainQA(
brain_id=str(brain.id),
chat_id=self.chat_id,
model=self.model,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
user_id=str(self.user_id),
raw=brain.raw,
jq_instructions=brain.jq_instructions,
).generate_answer
COMPOSITE: This type of brain uses the CompositeBrainQA
class itself to generate answers. It can combine multiple sources or methods to provide a comprehensive response.
if brain.brain_type == BrainType.COMPOSITE:
return self.generate_answer
INTEGRATION: This type of brain uses the IntegrationBrainQA
class to generate answers. It involves integrating various services or data sources to generate responses.
elif brain.brain_type == BrainType.INTEGRATION:
return IntegrationBrainQA(
brain_id=str(brain.id),
chat_id=self.chat_id,
max_tokens=self.max_tokens,
temperature=self.temperature,
streaming=self.streaming,
prompt_id=self.prompt_id,
integration=brain.integration,
integration_description=brain.integration_description,
).generate_answer
Regarding the issue where the reply did not originate from the document, it could be related to the type of brain being used. If a COMPOSITE
brain is used, it might combine multiple sources, which could lead to a response that does not strictly originate from the document. Similarly, an API
brain might fetch information from an external API, leading to a response that is not document-based. An INTEGRATION
brain could also pull data from various integrated services, resulting in a reply that may not be solely document-based.
Additionally, the Brain
type definition includes a documents
field, which indicates that a brain can have associated documents. If the brain type is not DOC
, the response might not be strictly based on these documents, leading to the observed issue [1][2][3][4].
To continue talking to Dosu, mention @dosu.
1) To further investigate the reason why the reply did not come from a PDF file in my brain, I added a log print in the generate.source method for loop of knowledge-based brain_qa. py, and the debugging log is as follows, ==========KnowledgeBrainQA==generator source==Processing source document Plan Artemis Sept 2020. pdf ==========KnowledgeBrainQA===generator source==Skipping source document Plan Artemis Sept Analysis has found that it may be related to model_compatible_with_function_calling in QuivrRAG. Do I need to add models "llama2" and "qwen: 1.8b" to this method?
Def model_compatible_with'function_calling (self):
If self.model in[
"Gpt-4o",
"Gpt-4 turbo",
"Gpt-4-turbo-2024-04-09",
"Gpt-4 turbo review",
"Gpt-4-0125 review",
"Gpt-4-1106-preview",
"Gpt-4",
"Gpt-4-0613",
"Gpt-3.5 turbo",
"Gpt-3.5-turbo-0125",
"Gpt-3.5-turbo-1106",
"Gpt-3.5-turbo-0613",
]:
return True
return False
2) The generatestream method of knowledge-based brain_qa. py is passed through this line (...KnowledgeBrainQA======generate_stream 44...) The printed log shows that the answer output here does not originate from the model in the brain. How did this asynchronous callback return the answer? Where does it come from?
Async def generatestream(
self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
)->AsyncIterable:
Print (f)
Conversational_qa_chain=self. knowledge_qa. get_chain()
Transformed_history, streamed_chat_history=(
Self. initialized_streamed_chat_history (chat_id, question)
)
Response_tokens=""
Sources=[]
Citations=[]
first = True
Config={"metadata": {"conversation_id": str (chat_id)}}
Async for chunk in conversational_qa_chain.stream(
{
"Question": question. question,
"Chat_history": transformed_history,
"Custom_personality":(
self.prompt_to_use.content if self.prompt_to_use else None
),
},
config=config,
):
if not streamed_chat_history.metadata:
Streamed_chat_history. metadata={}
Print
If self. model_compatible_with_function_calling (model=self. model):
Print (f)
If chunk. get ("answer"):
if first:
Gathered=chunk ["answer"]
first = False
Print (f)
else:
Gathered=gathered+chunk ["answer"]
If(
gathered.tool_calls
And gathered. tool_calls [-1]. get ("args")
And "answer" in gathered. tool_calls [-1] ["args"]
):
#Only send the difference between answers and response_tokens which was the previous answer
Answer=gathered. tool_calls [-1] ["args"] ["answer"]
Difference=answer [len (response_tokens):]
streamed_chat_history.assistant = difference
response_tokens = answer
Print (f)
Yield f "data: {JSON. dumps (streamed_chat_history. dict())}"
If(
gathered.tool_calls
And gathered. tool_calls [-1]. get ("args")
And "citations" in gathered. tool_calls [-1] ["args"]
):
Citations=gathered. tool_calls [-1] ["args"] [citations]
Print (f)
else:
If chunk. get ("answer"):
Response_tokens+=chunk ["answer"]. content
Streamed_chat_history. assistant=chunk ["answer"]. content
Yield f "data: {JSON. dumps (streamed_chat_history. dict())}"
Print (f)
If chunk. get ("docs"):
Sources=chunk ["docs"]
Print (f)
Sources_list=generate-source (sources, self.brain_id, citations)
#Serialize the sources list
Serialized_sources_list=[source. dict() for source in sources_list]
Streamed_chat_history. metadata ["sources"]=serialized_sources_list
Yield f "data: {JSON. dumps (streamed_chat_history. dict())}"
Self. save answer (question, response_tokens, streamed_chat_history, save_answer)
Adding Models to model_compatible_with_function_calling
:
Yes, you should add "llama2" and "qwen: 1.8b" to the model_compatible_with_function_calling
method to ensure that these models are recognized as compatible with function calling. Here is the updated method:
def model_compatible_with_function_calling(self):
if self.model in [
"gpt-4o",
"gpt-4-turbo",
"gpt-4-turbo-2024-04-09",
"gpt-4-turbo-preview",
"gpt-4-0125-preview",
"gpt-4-1106-preview",
"gpt-4",
"gpt-4-0613",
"gpt-3.5-turbo",
"gpt-3.5-turbo-0125",
"gpt-3.5-turbo-1106",
"gpt-3.5-turbo-0613",
"llama2",
"qwen: 1.8b",
]:
return True
return False
Understanding the generate_stream
Method:
The generate_stream
method in knowledge_based_brain_qa.py
is responsible for generating a stream of responses asynchronously. The method uses the conversational_qa_chain
to process the question and chat history, and it yields data as chunks of responses are generated.
Here is a breakdown of how the asynchronous callback returns the answer:
conversational_qa_chain
.save_answer
method.Here is the relevant code snippet with added comments for clarity:
async def generate_stream(
self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
) -> AsyncIterable:
conversational_qa_chain = self.knowledge_qa.get_chain()
transformed_history, streamed_chat_history = (
self.initialize_streamed_chat_history(chat_id, question)
)
response_tokens = ""
sources = []
citations = []
first = True
config = {"metadata": {"conversation_id": str(chat_id)}}
async for chunk in conversational_qa_chain.astream(
{
"question": question.question,
"chat_history": transformed_history,
"custom_personality": (
self.prompt_to_use.content if self.prompt_to_use else None
),
},
config=config,
):
if not streamed_chat_history.metadata:
streamed_chat_history.metadata = {}
if self.model_compatible_with_function_calling(model=self.model):
if chunk.get("answer"):
if first:
gathered = chunk["answer"]
first = False
else:
gathered = gathered + chunk["answer"]
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "answer" in gathered.tool_calls[-1]["args"]
):
answer = gathered.tool_calls[-1]["args"]["answer"]
difference = answer[len(response_tokens):]
streamed_chat_history.assistant = difference
response_tokens = answer
yield f"data: {json.dumps(streamed_chat_history.dict())}"
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "citations" in gathered.tool_calls[-1]["args"]
):
citations = gathered.tool_calls[-1]["args"]["citations"]
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "followup_questions" in gathered.tool_calls[-1]["args"]
):
followup_questions = gathered.tool_calls[-1]["args"][
"followup_questions"
]
streamed_chat_history.metadata["followup_questions"] = (
followup_questions
)
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "thoughts" in gathered.tool_calls[-1]["args"]
):
thoughts = gathered.tool_calls[-1]["args"]["thoughts"]
streamed_chat_history.metadata["thoughts"] = thoughts
else:
if chunk.get("answer"):
response_tokens += chunk["answer"].content
streamed_chat_history.assistant = chunk["answer"].content
yield f"data: {json.dumps(streamed_chat_history.dict())}"
if chunk.get("docs"):
sources = chunk["docs"]
sources_list = generate_source(sources, self.brain_id, citations)
serialized_sources_list = [source.dict() for source in sources_list]
streamed_chat_history.metadata["sources"] = serialized_sources_list
yield f"data: {json.dumps(streamed_chat_history.dict())}"
self.save_answer(question, response_tokens, streamed_chat_history, save_answer)
The answer comes from the conversational_qa_chain
, which processes the question and chat history to generate a response. The generate_source
method is used to manage the sources and citations within this method [1][2].
To continue talking to Dosu, mention @dosu.
When add "ollama/llama2" and "ollama/qwen:1.8b" to the model_compatible_with_function_calling method to ensure that these models are recognized as compatible with function calling. From the debug logs, it can be seen that OpenAI has been called (a connection error occurred due to the server being unable to connect to OpenAI), but I would prefer to use the local Ollama instead. There was no exception for this openai before modifying model_compatible_with_function_calling,So after the modification (with the addition of olama/lama2 and olama/qwen: 1.8b models in model_compatible_with'function_calling), why not call the local olama and instead call openai?
here is the code of model_compatible_with_function_calling:
def model_compatible_with_function_calling(self):
if self.model in [
"gpt-4o",
"gpt-4-turbo",
"gpt-4-turbo-2024-04-09",
"gpt-4-turbo-preview",
"gpt-4-0125-preview",
"gpt-4-1106-preview",
"gpt-4",
"gpt-4-0613",
"gpt-3.5-turbo",
"gpt-3.5-turbo-0125",
"gpt-3.5-turbo-1106",
"gpt-3.5-turbo-0613",
"ollama/llama2",
"ollama/qwen:1.8b"
]:
return True
return False
here is the log:
|File "/usr/local/lib/python3.11/site packages/openai/_base_client. py", line 1615, in _retry_request
|Return await self_request(
|^^^^^^^^^^^^^^^^^^^^
|File "/usr/local/lib/python3.11/site packages/openai/_base_client. py", line 1556, in _request
|Raise APIConnectionError (request=request) from err
|Openai APIConnectionError: Connection error.
log of ChatLiteLLM: ''' ChatLiteLLM(client=<module 'litellm' from '/usr/local/lib/python3.11/site-packages/litellm/init.py'>, model='ollama/qwen:1.8b', openai_api_key='tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ', azure_api_key='', anthropic_api_key='nu ll', replicate_api_key='', cohere_api_key='', openrouter_api_key='', api_base='http://172.17.0.1:11434', temperature=0.0, huggingface_api_key='', together_ai_api_key='') '''
The issue of OpenAI being called instead of the local Ollama after adding "ollama/llama2" and "ollama/qwen:1.8b" to the model_compatible_with_function_calling
method is likely due to the OLLAMA_API_BASE_URL
not being properly configured or activated.
To ensure that the local Ollama is called, follow these steps:
Uncomment and Set the OLLAMA_API_BASE_URL
:
Make sure that the OLLAMA_API_BASE_URL
is uncommented and correctly set in your .env
file. This URL should point to the local Ollama API.
# LOCAL
OLLAMA_API_BASE_URL=http://host.docker.internal:11434 # Uncomment to activate ollama. This is the local url for the ollama api
Verify the _create_llm
Method:
Ensure that the _create_llm
method in the HeadlessQA
class is correctly configured to use the local Ollama API when the model name starts with "ollama".
def _create_llm(
self,
model,
temperature=0,
streaming=False,
callbacks=None,
) -> BaseChatModel:
api_base = None
if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
api_base = self.brain_settings.ollama_api_base_url
logger.debug("Creating Ollama LLM with model: %s and API base: %s", model, api_base)
else:
logger.debug("Creating OpenAI LLM with model: %s", model)
return ChatLiteLLM(
temperature=temperature,
model=model,
streaming=streaming,
verbose=True,
callbacks=callbacks,
max_tokens=self.max_tokens,
api_base=api_base,
)
Check the Fallback Mechanism:
Ensure that the fallback mechanism is not defaulting to OpenAI due to a misconfiguration. The embeddings
property in the QuivrRAG
class should prioritize Ollama if the OLLAMA_API_BASE_URL
is set.
@property
def embeddings(self):
if self.brain_settings.ollama_api_base_url:
return OllamaEmbeddings(
base_url=self.brain_settings.ollama_api_base_url
) # pyright: ignore reportPrivateUsage=none
else:
return OpenAIEmbeddings()
Enable Debug Logging:
Use the get_logger
function to add debug statements in the code to trace why OpenAI is being called. This will help you understand the decision-making process.
from logger import get_logger
logger = get_logger(__name__)
def _create_llm(
self,
model,
temperature=0,
streaming=False,
callbacks=None,
) -> BaseChatModel:
api_base = None
if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
api_base = self.brain_settings.ollama_api_base_url
logger.debug("Creating Ollama LLM with model: %s and API base: %s", model, api_base)
else:
logger.debug("Creating OpenAI LLM with model: %s", model)
return ChatLiteLLM(
temperature=temperature,
model=model,
streaming=streaming,
verbose=True,
callbacks=callbacks,
max_tokens=self.max_tokens,
api_base=api_base,
)
By ensuring that the OLLAMA_API_BASE_URL
is correctly set and the _create_llm
method is properly configured, you should be able to use the local Ollama instead of defaulting to OpenAI [1][2][3].
To continue talking to Dosu, mention @dosu.
_The logs for create_llm were not printed, but the logs for get_chain and embeddings were printed and displayed as local URLs and model names. ··· ===========embeddings====base_url= http://172.17.0.1:11434. ... backend-core | ===========get_chain===api_base= http://172.17.0.1:11434. backend-core | ===========get_chain====model= ollama/qwen:1.8b. ··· and Ollama is running:
curl http://172.17.0.1:11434
Ollama is running[root@iZ2ze4s4xazmovh4mc9g9pZ brain]#
openai exception log: backend-core | | Traceback (most recent call last): backend-core | | File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap backend-core | | await func() backend-core | | File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 250, in stream_response backend-core | | async for chunk in self.body_iterator: backend-core | | File "/code/modules/brain/knowledge_brain_qa.py", line 352, in generate_stream backend-core | | async for chunk in conversational_qa_chain.astream( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2900, in astream backend-core | | async for chunk in self.atransform(input_aiter(), config, kwargs): backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2883, in atransform backend-core | | async for chunk in self._atransform_stream_with_config( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1980, in _atransform_stream_with_config backend-core | | chunk: Output = await asyncio.create_task( # type: ignore[call-arg] backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2853, in _atransform backend-core | | async for output in final_pipeline: backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 3315, in atransform backend-core | | async for chunk in self._atransform_stream_with_config( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1980, in _atransform_stream_with_config backend-core | | chunk: Output = await asyncio.create_task( # type: ignore[call-arg] backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 3302, in _atransform backend-core | | chunk = AddableDict({step_name: task.result()}) backend-core | | ^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 3285, in get_next_chunk backend-core | | return await py_anext(generator) backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2883, in atransform backend-core | | async for chunk in self._atransform_stream_with_config( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1980, in _atransform_stream_with_config backend-core | | chunk: Output = await asyncio.create_task( # type: ignore[call-arg] backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2853, in _atransform backend-core | | async for output in final_pipeline: backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 4748, in atransform backend-core | | async for item in self.bound.atransform( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1334, in atransform backend-core | | async for output in self.astream(final, config, kwargs): backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 319, in astream backend-core | | raise e backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 297, in astream backend-core | | async for chunk in self._astream( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_openai/chat_models/base.py", line 579, in _astream backend-core | | response = await self.async_client.create(messages=message_dicts, **params) backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/resources/chat/completions.py", line 1181, in create backend-core | | return await self._post( backend-core | | ^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1790, in post backend-core | | return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1493, in request backend-core | | return await self._request( backend-core | | ^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1546, in _request backend-core | | return await self._retry_request( backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1615, in _retry_request backend-core | | return await self._request( backend-core | | ^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1546, in _request backend-core | | return await self._retry_request( backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1615, in _retry_request backend-core | | return await self._request( backend-core | | ^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1556, in _request backend-core | | raise APIConnectionError(request=request) from err backend-core | | openai.APIConnectionError: Connection error. backend-core | +------------------------------------
Log (embeddings====base_URL= http://172.17.0.1:11434 )It can be seen that the base_URL in the embeddings method points to the local URL
Logs (get_chain====model=olama/lama2 and api_base=' http://172.17.0.1:11434 It can be seen that the api_base in the get_chain method points to the local URL, and the model is the local model olama/lama2
'''
INFO: 127.0.0.1:57506 - "GET /healthz HTTP/1.1" 200 OK
INFO: 113.118.173.118:16728 - "OPTIONS /chat/8483c3d6-f941-4716-8cb5-84d4ea3b93b8/question/stream?brain_id=1737bf90-7745-4c6f-9efe-6b5a685965a2 HTTP/1.1" 200 OK
INFO: 113.118.173.118:16731 - "GET /onboarding HTTP/1.1" 200 OK
INFO: 113.118.173.118:16729 - "GET /user HTTP/1.1" 200 OK
==========KnowledgeBrainQA======init
==========KnowledgeBrainQA======increase_usage_user
==========KnowledgeBrainQA======calculate_pricing
===========embeddings====base_url= http://172.17.0.1:11434.
INFO: 113.118.173.118:16728 - "POST /chat/8483c3d6-f941-4716-8cb5-84d4ea3b93b8/question/stream?brain_id=1737bf90-7745-4c6f-9efe-6b5a685965a2 HTTP/1.1" 200 OK
==========KnowledgeBrainQA======generate_stream
===========get_chain===api_base= http://172.17.0.1:11434.
===========get_chain====model= ollama/llama2.
========xxxxxxxxxxxx=====get_chain, loaded_memory:mapper={
chat_history: RunnableLambda(lambda x: self.filter_history(x['chat_history'])),
question: RunnableLambda(lambda x: x['question'])
} , standalone_question:{'standalone_question': {
question: RunnableLambda(...),
chat_history: RunnableLambda(itemgetter('chat_history'))
}
| PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. Keep as much details as
possible from previous messages. Keep entity names and all. \n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')
| ChatLiteLLM(client=<module 'litellm' from '/usr/local/lib/python3.11/site-packages/litellm/init.py'>, model='ollama/llama2', openai_api_key='tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ', azure_api_key='', anthropic_api_key='null'
, replicate_api_key='', cohere_api_key='', openrouter_api_key='', api_base='http://172.17.0.1:11434', temperature=0.0, huggingface_api_key='', together_ai_api_key='')
| StrOutputParser()}, retrieved_documents:{'docs': RunnableLambda(itemgetter('standalone_question'))
| ContextualCompressionRetriever(base_compressor=FlashrankRerank(client=<flashrank.Ranker.Ranker object at 0x7f9d9e4ff690>, top_n=10, model='ms-marco-TinyBERT-L-2-v2'), base_retriever=VectorStoreRetriever(tags=['CustomSupabaseVectorStore', 'Ol
lamaEmbeddings'], vectorstore=<vectorstore.supabase.CustomSupabaseVectorStore object at 0x7f9d9eed4dd0>)), 'question': <function QuivrRAG.get_chain.
This LiteLLM log indicates calling the local Ollama generate API. ''' POST Request Sent from LiteLLM: curl -X POST \ http://172.17.0.1:11434/api/generate \ -d '{'model': 'llama2', 'prompt': "Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. Keep as much details as possible from previous messages. Keep entity names and all. \n\nChat History:\n[HumanMessage(content='this is a question'),
DEBUG:httpx:load_ssl_context verify=True cert=None trust_env=True http2=False DEBUG:httpx:load_verify_locations cafile='/usr/local/lib/python3.11/site-packages/certifi/cacert.pem' DEBUG:httpcore.connection:connect_tcp.started host='172.17.0.1' port=11434 local_address=None timeout=6000 socket_options=None DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f9d9eede410> DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']> DEBUG:httpcore.http11:send_request_headers.complete DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']> DEBUG:httpcore.http11:send_request_body.complete DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']> DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/x-ndjson'), (b'Date', b'Fri, 07 Jun 2024 01:07:32 GMT'), (b'Transfer-Encoding', b'chunked')]) INFO:httpx:HTTP Request: POST http://172.17.0.1:11434/api/generate "HTTP/1.1 200 OK" DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']> '''
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1522, in _request response = await self._client.send( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1661, in send response = await self._send_handling_auth( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1689, in _send_handling_auth response = await self._send_handling_redirects( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects response = await self._send_single_request(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1763, in _send_single_request response = await transport.handle_async_request(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 372, in handle_async_request with map_httpcore_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in exit self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ConnectError: All connection attempts failed DEBUG:openai._base_client:1 retry left INFO:openai._base_client:Retrying request to /chat/completions in 0.791249 seconds ‘’‘
In summary, no issues were found with the local configuration, and it is true that the local model was also called. However, the openai exception log mentioned in point 5 also indicates that openai was called. From a process perspective, it occurred after the asynchronous chunk processing was completed. If not cleared, which place triggered and called the/chat/completion of openai?
The issue you're encountering, where OpenAI's API is being called despite the local Ollama model being configured, is likely due to a fallback mechanism or a misconfiguration in the code that handles the language model selection. Here are the steps to ensure that only the local Ollama model is used and to prevent any calls to OpenAI:
Ensure OLLAMA_API_BASE_URL
is Set Correctly:
You've already confirmed that the OLLAMA_API_BASE_URL
is set correctly and that the local Ollama model is being used in some parts of the code. This is good.
Check the model_compatible_with_function_calling
Method:
Ensure that the models "ollama/llama2" and "ollama/qwen:1.8b" are added to the model_compatible_with_function_calling
method, which you have already done.
Verify the _create_llm
Method:
Ensure that the _create_llm
method in the HeadlessQA
class is correctly configured to use the local Ollama API when the model name starts with "ollama". This method should set the api_base
to self.brain_settings.ollama_api_base_url
.
def _create_llm(
self,
model,
temperature=0,
streaming=False,
callbacks=None,
) -> BaseChatModel:
api_base = None
if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
api_base = self.brain_settings.ollama_api_base_url
logger.debug("Creating Ollama LLM with model: %s and API base: %s", model, api_base)
else:
logger.debug("Creating OpenAI LLM with model: %s", model)
return ChatLiteLLM(
temperature=temperature,
model=model,
streaming=streaming,
verbose=True,
callbacks=callbacks,
max_tokens=self.max_tokens,
api_base=api_base,
)
Check the generate_stream
Method:
The generate_stream
method in knowledge_based_brain_qa.py
might be falling back to OpenAI's API under certain conditions. Ensure that this method is correctly configured to use the local Ollama model.
async def generate_stream(
self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
) -> AsyncIterable:
conversational_qa_chain = self.knowledge_qa.get_chain()
transformed_history, streamed_chat_history = (
self.initialize_streamed_chat_history(chat_id, question)
)
response_tokens = ""
sources = []
citations = []
first = True
config = {"metadata": {"conversation_id": str(chat_id)}}
async for chunk in conversational_qa_chain.astream(
{
"question": question.question,
"chat_history": transformed_history,
"custom_personality": (
self.prompt_to_use.content if self.prompt_to_use else None
),
},
config=config,
):
if not streamed_chat_history.metadata:
streamed_chat_history.metadata = {}
if self.model_compatible_with_function_calling(model=self.model):
if chunk.get("answer"):
if first:
gathered = chunk["answer"]
first = False
else:
gathered = gathered + chunk["answer"]
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "answer" in gathered.tool_calls[-1]["args"]
):
answer = gathered.tool_calls[-1]["args"]["answer"]
difference = answer[len(response_tokens):]
streamed_chat_history.assistant = difference
response_tokens = answer
yield f"data: {json.dumps(streamed_chat_history.dict())}"
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "citations" in gathered.tool_calls[-1]["args"]
):
citations = gathered.tool_calls[-1]["args"]["citations"]
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "followup_questions" in gathered.tool_calls[-1]["args"]
):
followup_questions = gathered.tool_calls[-1]["args"][
"followup_questions"
]
streamed_chat_history.metadata["followup_questions"] = (
followup_questions
)
if (
gathered.tool_calls
and gathered.tool_calls[-1].get("args")
and "thoughts" in gathered.tool_calls[-1]["args"]
):
thoughts = gathered.tool_calls[-1]["args"]["thoughts"]
streamed_chat_history.metadata["thoughts"] = thoughts
else:
if chunk.get("answer"):
response_tokens += chunk["answer"].content
streamed_chat_history.assistant = chunk["answer"].content
yield f"data: {json.dumps(streamed_chat_history.dict())}"
if chunk.get("docs"):
sources = chunk["docs"]
sources_list = generate_source(sources, self.brain_id, citations)
serialized_sources_list = [source.dict() for source in sources_list]
streamed_chat_history.metadata["sources"] = serialized_sources_list
yield f"data: {json.dumps(streamed_chat_history.dict())}"
self.save_answer(question, response_tokens, streamed_chat_history, save_answer)
Check for Other Methods Making API Calls:
Ensure that other methods such as make_completion
, make_completion_without_streaming
, and get_api_call_response_as_text
in api_brain_qa.py
are not making calls to OpenAI's API.
async def make_completion(
self,
messages,
functions,
brain_id: UUID,
recursive_count=0,
should_log_steps=True,
) -> str | None:
response = completion(
model=self.model,
temperature=self.temperature,
max_tokens=self.max_tokens,
messages=messages,
functions=functions,
stream=True,
function_call="auto",
)
Disable OpenAI API Key:
Set the OPENAI_API_KEY
to a fake key to ensure that OpenAI's API is not called.
OPENAI_API_KEY=tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ
By following these steps, you should be able to ensure that only the local Ollama model is used and prevent any calls to OpenAI [1][2][3].
To continue talking to Dosu, mention @dosu.
Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.
What happened?
I ran the Ollama model on the local server and set my brain to the Ollama model. However, I encountered a problem while chatting with my brain. Due to the inability to directly connect to huggingface.co on my server, an exception was thrown while chatting:
Can the program automatically skip the download step and avoid the exception mentioned above if the ms marco TinyBERT-L-2-v2.zip file is manually downloaded and copied to the server? If possible, which directory on the server does it need to be copied to? If the methods mentioned above cannot solve the problem, what should be done?
Relevant log output
Twitter / LinkedIn details
No response