QuivrHQ / quivr

Open-source RAG Framework for building GenAI Second Brains 🧠 Build productivity assistant (RAG) ⚡️🤖 Chat with your docs (PDF, CSV, ...) & apps using Langchain, GPT 3.5 / 4 turbo, Private, Anthropic, VertexAI, Ollama, LLMs, Groq that you can share with users ! Efficient retrieval augmented generation framework
https://quivr.com
Other
36.15k stars 3.51k forks source link

[Bug]: Use Ollama model #2631

Closed andyzhangwp closed 3 weeks ago

andyzhangwp commented 4 months ago

What happened?

I ran the Ollama model on the local server and set my brain to the Ollama model. However, I encountered a problem while chatting with my brain. Due to the inability to directly connect to huggingface.co on my server, an exception was thrown while chatting:

Can the program automatically skip the download step and avoid the exception mentioned above if the ms marco TinyBERT-L-2-v2.zip file is manually downloaded and copied to the server? If possible, which directory on the server does it need to be copied to? If the methods mentioned above cannot solve the problem, what should be done?

Relevant log output

log  begin ==========================================
......
Backend core | | Traceback (most recent call last):
Backend core | | File "/usr/local/lib/python3.11/site packages/requests/adapters. py", line 486, in send
Backend core | | resp=conn. urlopen(
backend-core  |     |            ^^^^^^^^^^^^^
Backend core | | File "/usr/local/lib/python3.11/site packages/urllib3/connectionpool. py", line 847, in urlopen
Backend core | | retries=retries. increment(
backend-core  |     |               ^^^^^^^^^^^^^^^^^^
Backend core | | File "/usr/local/lib/python3.11/site packages/urllib3/util/retry. py", line 515, in increment
Backend core | | raise MaxRetryError (_pool, URL, reason) from reason # type: ignore [arg type]
backend-core  |     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
backend-core  |     | urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /prithivida/flashrank/resolve/main/ms-marco-TinyBERT-L-2-v2.zip (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe4ea7e73d0>, '**Connection to huggingface.co timed out.** (connect timeout=None)'))
backend-core  |     | 
......
log  end ============================================

Twitter / LinkedIn details

No response

dosubot[bot] commented 4 months ago

Hey @andyzhangwp! 👋 I'm here to help you with any bugs, questions, or contributions you have while we wait for a human maintainer. Let's squash those bugs together!

Yes, the program can automatically skip the download step and avoid the exception if the ms-marco-TinyBERT-L-2-v2.zip file is manually downloaded and copied to the server. The file needs to be copied to the directory where the program expects to find it, which is the current working directory of the script.

The relevant code snippet indicates that the program uses the FlashrankRerank model with ms-marco-TinyBERT-L-2-v2:

compressor = FlashrankRerank(model="ms-marco-TinyBERT-L-2-v2", top_n=20)

Ensure that the ms-marco-TinyBERT-L-2-v2.zip file is placed in the current working directory of the script to avoid the download step and the associated exception [1].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 4 months ago

Try the following steps, but the problem still persists: Step 1: Copy ms-marco-TinyBERT-L-2-v2.zip to these directories (/usr/local/Quivr/backend/models,/usr/local/Quivr/backend/modules/brain/rags) Step 2: Restart the application, docker-compose -up --build Which directory should I copy?

dosubot[bot] commented 4 months ago

To ensure the application recognizes the manually downloaded ms-marco-TinyBERT-L-2-v2.zip file and bypasses the download step, you should place the file in the directory where the application expects to find it. Based on the provided information, the current working directory of the script is likely the most appropriate location.

Here are the steps you should follow:

  1. Copy the ms-marco-TinyBERT-L-2-v2.zip file to the current working directory of the script. This is typically the root directory of the backend service in your Docker container.

  2. Modify the _create_llm method in the QuivrRAG class to ensure it uses the manually placed model file. You can adjust the api_base or the model parameter to point to the local path of the manually placed model file.

Here is the relevant part of the code that needs to be modified:

def _create_llm(
    self,
    callbacks,
    model,
    streaming=False,
    temperature=0,
) -> BaseLLM:
    """
    Create a LLM with the given parameters
    """
    if streaming and callbacks is None:
        raise ValueError(
            "Callbacks must be provided when using streaming language models"
        )

    api_base = None
    if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
        api_base = (
            self.brain_settings.ollama_api_base_url  # pyright: ignore reportPrivateUsage=none
        )
    return ChatLiteLLM(
        temperature=temperature,
        max_tokens=self.max_tokens,
        model=model,
        streaming=streaming,
        verbose=False,
        callbacks=callbacks,
        api_base=api_base,
    )  # pyright: ignore reportPrivateUsage=none
  1. Restart the application using Docker Compose:
docker-compose up --build

By following these steps, you should be able to bypass the download step and avoid the exception related to the connection timeout [1].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 4 months ago

I have modified the following code, but it still has the same issue,

def _create_llm(
    self,
    callbacks,
    model="/code/ms-marco-TinyBERT-L-2-v2.zip",
    streaming=False,
    temperature=0,
) -> BaseLLM:
    """
    Create a LLM with the given parameters
    """
    if streaming and callbacks is None:
        raise ValueError(
            "Callbacks must be provided when using streaming language models"
        )

    api_base = None
    if self.brain_settings.ollama_api_base_url: # and model.startswith("ollama"):
        api_base = self.brain_settings.ollama_api_base_url
dosubot[bot] commented 4 months ago

The issue might be due to the way the model path is being handled. The model parameter in the _create_llm method is typically expected to be a model name rather than a file path. To ensure the application uses the local model file, you need to ensure that the model loading mechanism is correctly configured to recognize and use the local file.

Here are the steps to correctly configure the local model file:

  1. Place the Model File in the Correct Directory: Ensure that the ms-marco-TinyBERT-L-2-v2.zip file is placed in a directory that the application can access. A common practice is to place it in a directory like /code/models/.

  2. Modify the Model Loading Mechanism: Instead of modifying the _create_llm method directly, you should ensure that the model loading mechanism within the application is configured to use the local file. This might involve modifying the part of the code where the model is loaded.

  3. Update the Environment Configuration: Ensure that the environment variables and configuration files are set up to point to the local model file.

Here is an example of how you might modify the code to load the model from a local file:

def _create_llm(
    self,
    callbacks,
    model="ms-marco-TinyBERT-L-2-v2",
    streaming=False,
    temperature=0,
) -> BaseLLM:
    """
    Create a LLM with the given parameters
    """
    if streaming and callbacks is None:
        raise ValueError(
            "Callbacks must be provided when using streaming language models"
        )

    api_base = None
    if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
        api_base = self.brain_settings.ollama_api_base_url

    # Check if the model file exists locally
    local_model_path = "/code/models/ms-marco-TinyBERT-L-2-v2.zip"
    if os.path.exists(local_model_path):
        model = local_model_path

    return ChatLiteLLM(
        temperature=temperature,
        max_tokens=self.max_tokens,
        model=model,
        streaming=streaming,
        verbose=False,
        callbacks=callbacks,
        api_base=api_base,
    )
  1. Restart the Application: After making these changes, restart the application using Docker Compose:
docker-compose up --build

By ensuring that the model file is correctly placed and the application is configured to use the local file, you should be able to bypass the download step and avoid the connection timeout issue.

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 4 months ago

We don't need to modify the code, we just need to copy the downloaded model file to an accessible location, specifically/tmp in the Docker application container, and decompress it. This allows for chatting based on the brain. However, this seems to be very resource intensive, with high CPU, IOPS, and BPS when running on a 2-core 8G server, and a chat taking more than 10 minutes. Is there any way to optimize it?

dosubot[bot] commented 4 months ago

To optimize the performance of the Ollama model running on a local server, you can follow these strategies:

  1. Enable Streaming: Streaming can help reduce response times by sending partial results as they are generated. Ensure that the streaming parameter is set to True.

    def _determine_streaming(self, streaming: bool) -> bool:
        """If the model name allows for streaming and streaming is declared, set streaming to True."""
        return streaming
    
    def _determine_callback_array(
        self, streaming
    ) -> List[AsyncIteratorCallbackHandler]:
        """If streaming is set, set the AsyncIteratorCallbackHandler as the only callback."""
        if streaming:
            return [AsyncIteratorCallbackHandler()]
        else:
            return []
  2. Adjust Model Parameters: Tuning parameters such as temperature and max_tokens can help optimize performance. Lowering the max_tokens can reduce the computational load.

    class HeadlessQA(BaseModel, QAInterface):
        brain_settings = BrainSettings()
        model: str
        temperature: float = 0.0
        max_tokens: int = 2000
        streaming: bool = False
        chat_id: str
        callbacks: Optional[List[AsyncIteratorCallbackHandler]] = None
        prompt_id: Optional[UUID] = None
  3. Use Efficient Models: Ensure that the model being used is optimized for performance. The ChatLiteLLM is designed to be lightweight.

    def _create_llm(
        self,
        model,
        temperature=0,
        streaming=False,
        callbacks=None,
    ) -> BaseChatModel:
        """
        Determine the language model to be used.
        :param model: Language model name to be used.
        :param streaming: Whether to enable streaming of the model
        :param callbacks: Callbacks to be used for streaming
        :return: Language model instance
        """
        api_base = None
        if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
            api_base = self.brain_settings.ollama_api_base_url
        return ChatLiteLLM(
            temperature=temperature,
            model=model,
            streaming=streaming,
            verbose=True,
            callbacks=callbacks,
            max_tokens=self.max_tokens,
            api_base=api_base,
        )
  4. Optimize Chat History Handling: Efficiently format and handle chat history to reduce overhead.

    def generate_answer(
        self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
    ) -> GetChatHistoryOutput:
        transformed_history = format_chat_history(
            chat_service.get_chat_history(self.chat_id)
        )
        prompt_content = (
            self.prompt_to_use.content if self.prompt_to_use else SYSTEM_MESSAGE
        )
    
        messages = format_history_to_openai_mesages(
            transformed_history, prompt_content, question.question
        )
        answering_llm = self._create_llm(
            model=self.model,
            streaming=False,
            callbacks=self.callbacks,
        )
        model_prediction = answering_llm.predict_messages(messages)
        answer = model_prediction.content
        if save_answer:
            new_chat = chat_service.update_chat_history(
                CreateChatHistory(
                    **{
                        "chat_id": chat_id,
                        "user_message": question.question,
                        "assistant": answer,
                        "brain_id": None,
                        "prompt_id": self.prompt_to_use_id,
                    }
                )
            )
        return GetChatHistoryOutput(
            **{
                "chat_id": chat_id,
                "user_message": question.question,
                "assistant": answer,
                "message_time": new_chat.message_time,
                "prompt_title": (
                    self.prompt_to_use.title if self.prompt_to_use else None
                ),
                "brain_name": None,
                "message_id": new_chat.message_id,
            }
        )
  5. Docker Resource Limits: Ensure that your Docker container has appropriate resource limits set. You can specify CPU and memory limits in your Docker Compose file.

    services:
      quivr-backend:
        deploy:
          resources:
            limits:
              cpus: '1.0'
              memory: 2700M
  6. Parallel Processing: Utilize parallel processing and multi-threading to optimize resource usage and response times. The quivr repository uses Celery and Uvicorn for this purpose.

    services:
      quivr-backend:
        command: uvicorn main:app --reload --host 0.0.0.0 --port 5050 --workers 6 --log-level info
      quivr-worker:
        command: celery -A celery_worker worker -l info
      quivr-beat:
        command: celery -A celery_worker beat -l info

By implementing these strategies, you can optimize the performance of the Ollama model running on a local server [1][2].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

Hi Dear, After optimizing using the above method, there hasn't been much improvement in performance. Perhaps a better performing server is needed to run smoothly. I think if we change llama 7b to a smaller model to run, such as qwen, the performance may be better. So I ran the qwen: 1.8b model locally through olama run qwen: 1.8b. And change the brain model from Ollama/llma2 to Ollama/qwen: 1.8b (attempting to change to Ollama/qwen indicates that the model cannot be found). After completing these tasks, @ the brain engaged in a conversation, but later discovered a phenomenon that the answer content did not come from the uploaded PDF document in the brain, which does not match the established conversation. How should we solve this problem?

dosubot[bot] commented 3 months ago

To resolve the issue of the answers not being derived from the uploaded PDF document after switching to the qwen: 1.8b model, you should ensure that the similarity_search function in the CustomSupabaseVectorStore is correctly configured to work with the new model. The issue might be due to the embeddings or the vector store configuration not being compatible with the new model.

Here is the relevant part of the code that performs the similarity search:

vector_store = CustomSupabaseVectorStore(
    supabase_client,
    embeddings,
    table_name="vectors",
    brain_id=str(brain_id),
    number_docs=20,
)
documents = vector_store.similarity_search(question, k=20, threshold=0.8)

Make sure that:

  1. The embeddings are generated using the qwen: 1.8b model.
  2. The CustomSupabaseVectorStore is properly initialized with these embeddings.
  3. The similarity_search function is compatible with the new model's embeddings.

Additionally, ensure that the PDF processing is correctly handled by the UnstructuredPDFLoader as shown in the process_pdf function:

from langchain_community.document_loaders import UnstructuredPDFLoader
from models import File

from .common import process_file

def process_pdf(
    file: File, brain_id, original_file_name, integration=None, integration_link=None
):
    return process_file(
        file=file,
        loader_class=UnstructuredPDFLoader,
        brain_id=brain_id,
        original_file_name=original_file_name,
        integration=integration,
        integration_link=integration_link,
    )

Ensure that the PDF documents are being correctly processed and loaded into the vector store using the UnstructuredPDFLoader. This will help in making sure that the documents are available for the similarity search and the answers are derived correctly.

Additionally, you should verify that the retrieval and answer generation process is correctly set up in the QuivrRAG class. Here is an example of how the documents are retrieved and the final answer is generated:

compressor = None
if os.getenv("COHERE_API_KEY"):
    compressor = CohereRerank(top_n=20)
else:
    compressor = FlashrankRerank(model="ms-marco-TinyBERT-L-2-v2", top_n=20)

retriever_doc = self.get_retriever()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever_doc
)

loaded_memory = RunnablePassthrough.assign(
    chat_history=RunnableLambda(
        lambda x: self.filter_history(x["chat_history"]),
    ),
    question=lambda x: x["question"],
)

api_base = None
if self.brain_settings.ollama_api_base_url and self.model.startswith("ollama"):
    api_base = self.brain_settings.ollama_api_base_url

standalone_question = {
    "standalone_question": {
        "question": lambda x: x["question"],
        "chat_history": itemgetter("chat_history"),
    }
    | CONDENSE_QUESTION_PROMPT
    | ChatLiteLLM(temperature=0, model=self.model, api_base=api_base)
    | StrOutputParser(),
}

prompt_custom_user = self.prompt_to_use()
prompt_to_use = "None"
if prompt_custom_user:
    prompt_to_use = prompt_custom_user.content

# Now we retrieve the documents
retrieved_documents = {
    "docs": itemgetter("standalone_question") | compression_retriever,
    "question": lambda x: x["standalone_question"],
    "custom_instructions": lambda x: prompt_to_use,
}

final_inputs = {
    "context": lambda x: self._combine_documents(x["docs"]),
    "question": itemgetter("question"),
    "custom_instructions": itemgetter("custom_instructions"),
    "files": lambda x: list_files,
}
llm = ChatLiteLLM(
    max_tokens=self.max_tokens,
    model=self.model,
    temperature=self.temperature,
    api_base=api_base,
)  # pyright: ignore reportPrivateUsage=none
if self.model_compatible_with_function_calling():

    # And finally, we do the part that returns the answers
    llm_function = ChatOpenAI(
        max_tokens=self.max_tokens,
        model=self.model,
        temperature=self.temperature,
    )
    llm = llm_function.bind_tools(
        [cited_answer],
        tool_choice="cited_answer",
    )

answer = {
    "answer": final_inputs | ANSWER_PROMPT | llm,
    "docs": itemgetter("docs"),
}

return loaded_memory | standalone_question | retrieved_documents | answer

Ensure that the retrieval and answer generation process is correctly set up and that the documents are being correctly retrieved and combined to form the context for the final answer. This will help in making sure that the answers are derived correctly from the uploaded PDF documents [1][2][3].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

Does it mean that the uploaded PDF file was not processed correctly, even before switching models, as neither vector nor brain_vector tables were recorded in the data? How should this situation be resolved?

dosubot[bot] commented 3 months ago

Yes, if the vector and brain_vector tables are not recorded in the data, it likely means that the uploaded PDF file has not been processed correctly. To resolve this situation, you should ensure that the PDF file is processed correctly and the vectors are recorded in the respective tables.

Here are the steps to ensure proper processing:

  1. Use the process_pdf Function: The process_pdf function in backend/packages/files/parsers/pdf.py should be used to process the PDF file. This function utilizes the UnstructuredPDFLoader to load the PDF and then processes it using the process_file function.

    from langchain_community.document_loaders import UnstructuredPDFLoader
    from models import File
    
    from .common import process_file
    
    def process_pdf(
       file: File, brain_id, original_file_name, integration=None, integration_link=None
    ):
       return process_file(
           file=file,
           loader_class=UnstructuredPDFLoader,
           brain_id=brain_id,
           original_file_name=original_file_name,
           integration=integration,
           integration_link=integration_link,
       )
  2. Ensure Proper Implementation of process_file: The process_file function in backend/packages/files/parsers/common.py handles the detailed processing of the file, including splitting the document into chunks, encoding the text, and creating vectors.

    import asyncio
    import os
    import tempfile
    import time
    
    import nest_asyncio
    import tiktoken
    import uvloop
    from langchain.schema import Document
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from llama_parse import LlamaParse
    from logger import get_logger
    from models import File
    from modules.brain.service.brain_vector_service import BrainVectorService
    from modules.upload.service.upload_file import DocumentSerializable
    from packages.embeddings.vectors import Neurons
    
    if not isinstance(asyncio.get_event_loop(), uvloop.Loop):
       nest_asyncio.apply()
    
    logger = get_logger(__name__)
    
    async def process_file(
       file: File,
       loader_class,
       brain_id,
       original_file_name,
       integration=None,
       integration_link=None,
    ):
       dateshort = time.strftime("%Y%m%d")
       neurons = Neurons()
    
       if os.getenv("LLAMA_CLOUD_API_KEY"):
           doc = file.file
           document_ext = os.path.splitext(doc.filename)[1]
           if document_ext in [".pdf", ".docx", ".doc"]:
               document_tmp = tempfile.NamedTemporaryFile(
                   suffix=document_ext, delete=False
               )
               doc.file.seek(0)
               document_tmp.write(doc.file.read())
    
               parser = LlamaParse(
                   result_type="markdown",
                   parsing_instruction="Extract the tables and transform checkboxes into text. Transform tables to key = value. You can duplicates Keys if needed. For example: Productions Fonts = 300 productions Fonts Company Desktop License = Yes for Maximum of 60 Licensed Desktop users For example checkboxes should be: Premium Activated = Yes License Premier = No If a checkbox is present for a table with multiple options.  Say Yes for the one activated and no for the one not activated. Format using headers.",
                   gpt4o_mode=True,
                   gpt4o_api_key=os.getenv("OPENAI_API_KEY"),
               )
               document_llama_parsed = parser.load_data(document_tmp.name)
               document_tmp.close()
               document_to_langchain = document_llama_parsed[0].to_langchain_format()
               text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                   chunk_size=file.chunk_size, chunk_overlap=file.chunk_overlap
               )
               document_to_langchain = Document(
                   page_content=document_to_langchain.page_content
               )
               file.documents = text_splitter.split_documents([document_to_langchain])
       else:
           file.compute_documents(loader_class)
    
       metadata = {
           "file_sha1": file.file_sha1,
           "file_size": file.file_size,
           "file_name": file.file_name,
           "chunk_size": file.chunk_size,
           "chunk_overlap": file.chunk_overlap,
           "date": dateshort,
           "original_file_name": original_file_name or file.file_name,
           "integration": integration or "",
           "integration_link": integration_link or "",
       }
       docs = []
    
       enc = tiktoken.get_encoding("cl100k_base")
    
       if file.documents is not None:
           for index, doc in enumerate(file.documents, start=1):
               new_metadata = metadata.copy()
               doc.page_content = f"Filename: {new_metadata['original_file_name']} Content: {doc.page_content}"
               doc.page_content = doc.page_content.replace("\u0000", "")
               len_chunk = len(enc.encode(doc.page_content))
               doc.page_content = doc.page_content.encode("utf-8", "replace").decode("utf-8")
               new_metadata["chunk_size"] = len_chunk
               new_metadata["index"] = index
               doc_with_metadata = DocumentSerializable(
                   page_content=doc.page_content, metadata=new_metadata
               )
               docs.append(doc_with_metadata)
    
       created_vector = neurons.create_vector(docs)
    
       brain_vector_service = BrainVectorService(brain_id)
       for created_vector_id in created_vector:
           result = brain_vector_service.create_brain_vector(
               created_vector_id, metadata["file_sha1"]
           )
           logger.debug(f"Brain vector created: {result}")
    
       if created_vector:
           return len(created_vector)
       else:
           return 0
  3. Check the Neurons Class: Ensure that the Neurons class in backend/packages/embeddings/vectors.py is correctly creating vectors. The create_vector method should add documents to the vector store and return the resulting IDs.

    class Neurons(BaseModel):
       def create_vector(self, docs):
           documents_vector_store = get_documents_vector_store()
    
           try:
               sids = documents_vector_store.add_documents(docs)
               if sids and len(sids) > 0:
                   return sids
    
           except Exception as e:
               logger.error(f"Error creating vector for document {e}")
  4. Verify Database Interactions: Ensure that the BrainsVectors class in backend/modules/brain/repository/brains_vectors.py is correctly interacting with the vector and brain_vector tables. The create_brain_vector method should correctly insert records into the brains_vectors table.

    class BrainsVectors(BrainsVectorsInterface):
       def __init__(self):
           supabase_client = get_supabase_client()
           self.db = supabase_client
    
       def create_brain_vector(self, brain_id, vector_id, file_sha1):
           response = (
               self.db.table("brains_vectors")
               .insert(
                   {
                       "brain_id": str(brain_id),
                       "vector_id": str(vector_id),
                       "file_sha1": file_sha1,
                   }
               )
               .execute()
           )
           return response.data

By following these steps, you can ensure that the PDF file is processed correctly and the vectors are recorded in the vector and brain_vector tables [1][2][3][4].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

The process_file function in backend/packages/files/parsers/common.py:

  1. UnstructuredPDFLoader is used to load PDF files.
  2. The program did not enter this if condition (if file. documents is not None), so the docs are empty and the documents are not vectorized. Is this related to not setting the environment variable LLAMA-CLOUD-API_KEY? if file.documents is not None: For index, doc in enumerate (file. documents, start=1): New_metadata=metadata. copy() Doc. page_content=f "File name: {new_metada ['original_filename ']} Content: {doc. page_content}" Doc. page content=doc. page content. replace ("\ u0000", "") Len_chunk=len (enc. encode (doc. page_content)) Doc. page content=doc. page content. encode ("utf-8", "replace"). decode ("utf-8") New_metadata ["chunk_size"]=len_chunk New_metadata ["index"]=index Doc_with_metadata=DocumentSerializable( page_content=doc.page_content, metadata=new_metadata ) Docs. append (doc_with_metadata)
dosubot[bot] commented 3 months ago

The issue with the process_file function not entering the condition if file.documents is not None is indeed related to the environment variable LLAMA_CLOUD_API_KEY. The function checks for the presence of this environment variable to decide whether to process the file using the LlamaParse parser or to use the compute_documents method.

To ensure the documents are processed and vectorized correctly, you need to set the LLAMA_CLOUD_API_KEY environment variable. You can set this environment variable in your operating system or within your application. Here is an example of how to set it in a Unix-based system:

export LLAMA_CLOUD_API_KEY=your_api_key_here

Or, if you are using a Python script, you can set it at the beginning of your script:

import os

os.environ['LLAMA_CLOUD_API_KEY'] = 'your_api_key_here'

If the issue persists even after setting the environment variable, you should ensure that the file.documents attribute is being correctly assigned within the process_file function. This involves verifying that the text_splitter.split_documents method is functioning as expected and that the file.documents attribute is being populated correctly [1].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

After configuring LLAMA-CLOUD-API_KEY, the vectors and brain_vectors tables now have data, But the answer to the conversation still doesn't come from the brain's PDF file. I tried to print the log of QuivrRAG answers, and from the log, I can see that the switched model (olama/qwen: 1.8b) has already been used, and no other configuration issues have been found. I am not very clear about the specific processing flow of the answer. How can I further investigate the problem and find out the real reason why the answer is not from the brain's PDF file?

code: ...... print(f"========xxxxxxxxxxxx=====get_chain, loaded_memory:{loaded_memory} , standalone_question: {standalone_question}, retrieved_documents:{retrieved_documents}, answer:{answer}") return loaded_memory | standalone_question | retrieved_documents | answer

The log is as follows: ========QuivrRAG. get_chain, loaded_memory: mapper={ Backend core | chat_history: RunnableLambda (lambda x: self. filter_history (x ['chat_history ')), Backend core | question: RunnableLambda (lambda x: x ['question ']) Backend core |}, standalone question: {'standalone_question ':{ Backend core | question: RunnableLambda (...), Backend core | chat_history: RunnableLambda (itemmeter ('chat_history ')) backend-core | } Backend core | | PromptTemplate (input_variables=['chat_history ',' question '], template='Give the following conversation and a following up question, repeat the following up question to be a standalone question, in its original language. Keep as many details as possible from previous messages. Keep entity names and all. \ n \ nChat History: \ n {chat_history} \ nFollow Up Input: {question} \ nStandalone question:') Backend core | | ChatLiteLLM (client=<module 'litellm' from '/usr/local/lib/python3.11/site packages/litellm/ init. py'>, model='ollama/qwen:1.8b', openai_api_key='tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ', azure_api_key='', anthropic_api_key='null', replicate_api_key='', cohere_api_key='', openrouter_api_key='', api_base=' http://172.17.0.1:11434 , temperature=0.0, huggingface_api_key='', together_ai_api_key='') Backend core | | StrOutputParser()}, retrieved documents: {'docs': RunnableLambda (itemgetter ('standalone_question ')) Backend core | | ContextualCompressionRetriever (base_compressor=FlashrankRetriever (client=<flashrank. Ranker. Ranker object at 0x7f23b931af90>, top-n=10, model='ms marco TinyBERT-L-2-v2 '), base-retriever=VectorStoreRetriever (tags=['CustomiSupabaseVectorStore', 'OllamaEmbedding'], vectorstore=<vectorstore. subpabase. CustomiSupabase VectorStore object at 0x7f23d86a1310), 'question':<function QuivrRAG. get_chain..at 0x7f23b94045e0>, 'custom_instructions':<function QuivrRAG. get_chain..at 0x7f23b9313920>}, answer: {'answer':{ Backend core | context: RunnableLambda (...), Backend core | question: RunnableLambda (itemmeter ('question '), Backend core | custom_instructions: RunnableLambda (itemmeter ('custom_instructions')) backend-core | } Backend core | | ChatPromptTemplate (input_variables=['context ','custom_instructions','question'], messages=[SystemMessage PromptTemplate (prompt=PromptTemplate (input_variables=['custom_instructions'], template="\ nWhen answering use markup to make it conform and not." \"NUse the following pieces of context from files provided by the user that is store in a brain to answer the user question in the same language as the user question. Your name is Quivre. You're a helpful assistant. \ nIf you don't know the answer with the context provided from the files, just say that you don't know, don't try to make up an answer. \ nUser instruction to follow" If provided to answer: {custom_instructions} \ n Backend core | | ChatLiteLLM (client=<module 'litellm' from '/usr/local/lib/python3.11/site packages/litellm/ init. py'>, model='ollama/qwen:1.8b', openai_api_key='tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ', azure_api_key='', anthropic_api_key='null', replicate_api_key='', cohere_api_key='', openrouter_api_key='', api_base=' http://172.17.0.1:11434 ', temperature=0.1, max_tokens=1000, huggingface-api_key=' ', together-ai-api_key=' '),' docs': operator. itemgeter ('docs')}

dosubot[bot] commented 3 months ago

To further investigate the problem and find out the real reason why the answer is not coming from the brain's PDF file, you should start by checking the process_file function in the common module. The process_pdf function relies on process_file to handle the PDF processing using the UnstructuredPDFLoader.

Here is the relevant code snippet from backend/packages/files/parsers/pdf.py:

from langchain_community.document_loaders import UnstructuredPDFLoader
from models import File

from .common import process_file

def process_pdf(
    file: File,
    brain_id,
    original_file_name,
    integration=None,
    integration_link=None,
):
    return process_file(
        file=file,
        loader_class=UnstructuredPDFLoader,
        brain_id=brain_id,
        original_file_name=original_file_name,
        integration=integration,
        integration_link=integration_link,
    )

You should verify the following:

  1. Check the process_file function: Ensure that it correctly processes the file and handles the UnstructuredPDFLoader.
  2. Verify the UnstructuredPDFLoader: Make sure that this loader is correctly implemented and capable of extracting the necessary information from the PDF.
  3. Debugging and Logging: Add logging or debugging statements in the process_file function to trace the flow of data and identify where it might be failing.

Additionally, you should inspect the CompositeBrainQA class in backend/modules/brain/composite_brain_qa.py to ensure that the document retrieval and question answering processes are functioning correctly. Here is a relevant snippet:

class CompositeBrainQA(
    KnowledgeBrainQA,
):
    user_id: UUID

    def __init__(
        self,
        model: str,
        brain_id: str,
        chat_id: str,
        streaming: bool = False,
        prompt_id: Optional[UUID] = None,
        **kwargs,
    ):
        user_id = kwargs.get("user_id")
        if not user_id:
            raise HTTPException(status_code=400, detail="Cannot find user id")

        super().__init__(
            model=model,
            brain_id=brain_id,
            chat_id=chat_id,
            streaming=streaming,
            prompt_id=prompt_id,
            **kwargs,
        )
        self.user_id = user_id

    def get_answer_generator_from_brain_type(self, brain: BrainEntity):
        if brain.brain_type == BrainType.COMPOSITE:
            return self.generate_answer
        elif brain.brain_type == BrainType.API:
            return APIBrainQA(
                brain_id=str(brain.id),
                chat_id=self.chat_id,
                model=self.model,
                max_tokens=self.max_tokens,
                temperature=self.temperature,
                streaming=self.streaming,
                prompt_id=self.prompt_id,
                user_id=str(self.user_id),
                raw=brain.raw,
                jq_instructions=brain.jq_instructions,
            ).generate_answer
        elif brain.brain_type == BrainType.DOC:
            return KnowledgeBrainQA(
                brain_id=str(brain.id),
                chat_id=self.chat_id,
                max_tokens=self.max_tokens,
                temperature=self.temperature,
                streaming=self.streaming,
                prompt_id=self.prompt_id,
            ).generate_answer

    def generate_answer(
        self, chat_id: UUID, question: ChatQuestion, save_answer: bool
    ) -> str:
        brain = brain_service.get_brain_by_id(question.brain_id)

        connected_brains = brain_service.get_connected_brains(self.brain_id)
        if not connected_brains:
            response = HeadlessQA(
                chat_id=chat_id,
                model=self.model,
                max_tokens=self.max_tokens,
                temperature=self.temperature,
                streaming=self.streaming,
                prompt_id=self.prompt_id,
            ).generate_answer(chat_id, question, save_answer=False)
            if save_answer:
                new_chat = chat_service.update_chat_history(
                    CreateChatHistory(
                        **{
                            "chat_id": chat_id,
                            "user_message": question.question,
                            "assistant": response.assistant,
                            "brain_id": question.brain_id,
                            "prompt_id": self.prompt_to_use_id,
                        }
                    )
                )
                return GetChatHistoryOutput(
                    **{
                        "chat_id": chat_id,
                        "user_message": question.question,
                        "assistant": response.assistant,
                        "message_time": new_chat.message_time,
                        "prompt_title": (
                            self.prompt_to_use.title if self.prompt_to_use else None
                        ),
                        "brain_name": brain.name,
                        "message_id": new_chat.message_id,
                        "brain_id": str(brain.id),
                    }
                )
            return GetChatHistoryOutput(
                **{
                    "chat_id": chat_id,
                    "user_message": question.question,
                    "assistant": response.assistant,
                    "message_time": None,
                    "prompt_title": (
                        self.prompt_to_use.title if self.prompt_to_use else None
                    ),
                    "brain_name": brain.name,
                    "message_id": None,
                    "brain_id": str(brain.id),
                }
            )
        tools = []
        available_functions = {}

        connected_brains_details = {}
        for connected_brain_id in connected_brains:
            connected_brain = brain_service.get_brain_by_id(connected_brain_id)
            if connected_brain is None:
                continue

            tools.append(format_brain_to_tool(connected_brain))

            available_functions[connected_brain_id] = (
                self.get_answer_generator_from_brain_type(connected_brain)
            )

            connected_brains_details[str(connected_brain.id)] = connected_brain

        CHOOSE_BRAIN_FROM_TOOLS_PROMPT = (
            "Based on the provided user content, find the most appropriate tools to answer"
            + "If you can't find any tool to answer and only then, and if you can answer without using any tool. In that case, let the user know that you are not using any particular brain (i.e tool) "
        )

        messages = [{"role": "system", "content": CHOOSE_BRAIN_FROM_TOOLS_PROMPT}]

        history = chat_service.get_chat_history(self.chat_id)

        for message in history:
            formatted_message = [
                {"role": "user", "content": message.user_message},
                {"role": "assistant", "content": message.assistant},
            ]
            messages.extend(formatted_message)

        messages.append({"role": "user", "content": question.question})

        response = completion(
            model="gpt-3.5-turbo-0125",
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )

        brain_completion_output = self.make_recursive_tool_calls(
            messages,
            question,
            chat_id,
            tools,
            available_functions,
            recursive_count=0,
            last_completion_response=response.choices[0],
        )
        if brain_completion_output:
            answer = brain_completion_output.response.message.content
            new_chat = None
            if save_answer:
                new_chat = chat_service.update_chat_history(
                    CreateChatHistory(
                        **{
                            "chat_id": chat_id,
                            "user_message": question.question,
                            "assistant": answer,
                            "brain_id": question.brain_id,
                            "prompt_id": self.prompt_to_use_id,
                        }
                    )
                )
            return GetChatHistoryOutput(
                **{
                    "chat_id": chat_id,
                    "user_message": question.question,
                    "assistant": brain_completion_output.response.message.content,
                    "message_time": new_chat.message_time if new_chat else None,
                    "prompt_title": (
                        self.prompt_to_use.title if self.prompt_to_use else None
                    ),
                    "brain_name": brain.name if brain else None,
                    "message_id": new_chat.message_id if new_chat else None,
                    "brain_id": str(brain.id) if brain else None,
                }
            )

Ensure that the generate_answer method correctly retrieves and processes the documents from the brain's knowledge base. Verify that the brain_service is correctly fetching the PDF files and that the make_recursive_tool_calls method is functioning as expected.

Additionally, you should inspect the get_question_context_from_brain function in backend/modules/brain/service/get_question_context_from_brain.py to ensure that the vector store is correctly searching for and retrieving relevant documents based on the question. Here is the relevant snippet:

from uuid import UUID

from attr import dataclass
from logger import get_logger
from models.settings import get_embeddings, get_supabase_client
from modules.upload.service.generate_file_signed_url import generate_file_signed_url
from vectorstore.supabase import CustomSupabaseVectorStore

logger = get_logger(__name__)

@dataclass
class DocumentAnswer:
    file_name: str
    file_sha1: str
    file_size: int
    file_url: str = ""
    file_id: str = ""
    file_similarity: float = 0.0

def get_question_context_from_brain(brain_id: UUID, question: str) -> str:
    """Finds the best brain to answer the question based on the question's meaning.

    Args:
        brain_id (UUID): Id of the brain to search in
        question (str): Question to search for in the vector store

    Returns:
        str: _descripton_
    """
    # TODO: Move to AnswerGenerator service
    supabase_client = get_supabase_client()
    embeddings = get_embeddings()

    vector_store = CustomSupabaseVectorStore(
        supabase_client,
        embeddings,
        table_name="vectors",
        brain_id=str(brain_id),
        number_docs=20,
    )
    documents = vector_store.similarity_search(question, k=20, threshold=0.8)

    answers = []
    file_sha1s = []
    for document in documents:
        if document.metadata["file_sha1"] not in file_sha1s:
            file_sha1s.append(document.metadata["file_sha1"])
            file_path_in_storage = f"{brain_id}/{document.metadata['file_name']}"
            answers.append(
                DocumentAnswer(
                    file_name=document.metadata["file_name"],
                    file_sha1=document.metadata["file_sha1"],
                    file_size=document.metadata["file_size"],
                    file_id=document.metadata["id"],
                    file_similarity=document.metadata["similarity"],
                    file_url=generate_file_signed_url(file_path_in_storage).get(
                        "signedURL", ""
                    ),
                ),
            )

    return answers

Ensure that the get_question_context_from_brain function is correctly retrieving documents from the vector store and that the CustomSupabaseVectorStore is functioning as expected. Verify that the documents retrieved are relevant to the question and that the PDF files are being correctly processed and included in the search results [1][2][3][4].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

I added print logs at the entrance of the functions in CompositeBrainQA for debugging. However, when the front-end already displayed a reply, the newly added logs were not printed, and I felt like I had skipped processing this class. Is there any other branch processing available? (The current conversation's brain type is doc)

dosubot[bot] commented 3 months ago

Yes, there is another branch processing available in CompositeBrainQA that could cause the processing to skip this class when the brain type is doc. Specifically, in the get_answer_generator_from_brain_type method, if the brain.brain_type is BrainType.DOC, the method returns the generate_answer method from the KnowledgeBrainQA class instead of the CompositeBrainQA class.

Here is the relevant code snippet:

def get_answer_generator_from_brain_type(self, brain: BrainEntity):
    if brain.brain_type == BrainType.COMPOSITE:
        return self.generate_answer
    elif brain.brain_type == BrainType.API:
        return APIBrainQA(
            brain_id=str(brain.id),
            chat_id=self.chat_id,
            model=self.model,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
            streaming=self.streaming,
            prompt_id=self.prompt_id,
            user_id=str(self.user_id),
            raw=brain.raw,
            jq_instructions=brain.jq_instructions,
        ).generate_answer
    elif brain.brain_type == BrainType.DOC:
        return KnowledgeBrainQA(
            brain_id=str(brain.id),
            chat_id=self.chat_id,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
            streaming=self.streaming,
            prompt_id=self.prompt_id,
        ).generate_answer

In this code, when brain.brain_type is BrainType.DOC, the generate_answer method from KnowledgeBrainQA is used, effectively skipping the CompositeBrainQA class [1].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

I manually modified the type to doc in the database because I found that there was no place to set other types when creating the brain in the front-end, only one defaulted to integration. What are the differences between these four types of brain (doc, API, composite, integration)? Is it related to the issue that the reply mentioned above did not originate from the document?

dosubot[bot] commented 3 months ago

The differences between the four types of brain (doc, API, composite, integration) are as follows:

  1. DOC: This type of brain uses the KnowledgeBrainQA class to generate answers. It is designed to work with documents and knowledge bases to provide responses based on the content of those documents.

    elif brain.brain_type == BrainType.DOC:
        return KnowledgeBrainQA(
            brain_id=str(brain.id),
            chat_id=self.chat_id,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
            streaming=self.streaming,
            prompt_id=self.prompt_id,
        ).generate_answer
  2. API: This type of brain uses the APIBrainQA class to generate answers. It is designed to interact with external APIs to fetch and provide responses.

    elif brain.brain_type == BrainType.API:
        return APIBrainQA(
            brain_id=str(brain.id),
            chat_id=self.chat_id,
            model=self.model,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
            streaming=self.streaming,
            prompt_id=self.prompt_id,
            user_id=str(self.user_id),
            raw=brain.raw,
            jq_instructions=brain.jq_instructions,
        ).generate_answer
  3. COMPOSITE: This type of brain uses the CompositeBrainQA class itself to generate answers. It can combine multiple sources or methods to provide a comprehensive response.

    if brain.brain_type == BrainType.COMPOSITE:
        return self.generate_answer
  4. INTEGRATION: This type of brain uses the IntegrationBrainQA class to generate answers. It involves integrating various services or data sources to generate responses.

    elif brain.brain_type == BrainType.INTEGRATION:
        return IntegrationBrainQA(
            brain_id=str(brain.id),
            chat_id=self.chat_id,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
            streaming=self.streaming,
            prompt_id=self.prompt_id,
            integration=brain.integration,
            integration_description=brain.integration_description,
        ).generate_answer

Regarding the issue where the reply did not originate from the document, it could be related to the type of brain being used. If a COMPOSITE brain is used, it might combine multiple sources, which could lead to a response that does not strictly originate from the document. Similarly, an API brain might fetch information from an external API, leading to a response that is not document-based. An INTEGRATION brain could also pull data from various integrated services, resulting in a reply that may not be solely document-based.

Additionally, the Brain type definition includes a documents field, which indicates that a brain can have associated documents. If the brain type is not DOC, the response might not be strictly based on these documents, leading to the observed issue [1][2][3][4].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

1) To further investigate the reason why the reply did not come from a PDF file in my brain, I added a log print in the generate.source method for loop of knowledge-based brain_qa. py, and the debugging log is as follows, ==========KnowledgeBrainQA==generator source==Processing source document Plan Artemis Sept 2020. pdf ==========KnowledgeBrainQA===generator source==Skipping source document Plan Artemis Sept Analysis has found that it may be related to model_compatible_with_function_calling in QuivrRAG. Do I need to add models "llama2" and "qwen: 1.8b" to this method?

Def model_compatible_with'function_calling (self):
If self.model in[
"Gpt-4o",
"Gpt-4 turbo",
"Gpt-4-turbo-2024-04-09",
"Gpt-4 turbo review",
"Gpt-4-0125 review",
"Gpt-4-1106-preview",
"Gpt-4",
"Gpt-4-0613",
"Gpt-3.5 turbo",
"Gpt-3.5-turbo-0125",
"Gpt-3.5-turbo-1106",
"Gpt-3.5-turbo-0613",
]:
return True
return False

2) The generatestream method of knowledge-based brain_qa. py is passed through this line (...KnowledgeBrainQA======generate_stream 44...) The printed log shows that the answer output here does not originate from the model in the brain. How did this asynchronous callback return the answer? Where does it come from?

Async def generatestream(
self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
)->AsyncIterable:
Print (f)
Conversational_qa_chain=self. knowledge_qa. get_chain()
Transformed_history, streamed_chat_history=(
Self. initialized_streamed_chat_history (chat_id, question)
)
Response_tokens=""
Sources=[]
Citations=[]
first = True
Config={"metadata": {"conversation_id": str (chat_id)}}
Async for chunk in conversational_qa_chain.stream(
{
"Question": question. question,
"Chat_history": transformed_history,
"Custom_personality":(
self.prompt_to_use.content if self.prompt_to_use else None
),
},
config=config,
):
if not streamed_chat_history.metadata:
Streamed_chat_history. metadata={}
Print
If self. model_compatible_with_function_calling (model=self. model):
Print (f)
If chunk. get ("answer"):
if first:
Gathered=chunk ["answer"]
first = False
Print (f)
else:
Gathered=gathered+chunk ["answer"]
If(
gathered.tool_calls
And gathered. tool_calls [-1]. get ("args")
And "answer" in gathered. tool_calls [-1] ["args"]
):
#Only send the difference between answers and response_tokens which was the previous answer
Answer=gathered. tool_calls [-1] ["args"] ["answer"]
Difference=answer [len (response_tokens):]
streamed_chat_history.assistant = difference
response_tokens = answer
Print (f)
Yield f "data: {JSON. dumps (streamed_chat_history. dict())}"
If(
gathered.tool_calls
And gathered. tool_calls [-1]. get ("args")
And "citations" in gathered. tool_calls [-1] ["args"]
):
Citations=gathered. tool_calls [-1] ["args"] [citations]
Print (f)
else:
If chunk. get ("answer"):
Response_tokens+=chunk ["answer"]. content
Streamed_chat_history. assistant=chunk ["answer"]. content
Yield f "data: {JSON. dumps (streamed_chat_history. dict())}"
Print (f)
If chunk. get ("docs"):
Sources=chunk ["docs"]
Print (f)
Sources_list=generate-source (sources, self.brain_id, citations)
#Serialize the sources list
Serialized_sources_list=[source. dict() for source in sources_list]
Streamed_chat_history. metadata ["sources"]=serialized_sources_list
Yield f "data: {JSON. dumps (streamed_chat_history. dict())}"
Self. save answer (question, response_tokens, streamed_chat_history, save_answer)
dosubot[bot] commented 3 months ago
  1. Adding Models to model_compatible_with_function_calling: Yes, you should add "llama2" and "qwen: 1.8b" to the model_compatible_with_function_calling method to ensure that these models are recognized as compatible with function calling. Here is the updated method:

    def model_compatible_with_function_calling(self):
       if self.model in [
           "gpt-4o",
           "gpt-4-turbo",
           "gpt-4-turbo-2024-04-09",
           "gpt-4-turbo-preview",
           "gpt-4-0125-preview",
           "gpt-4-1106-preview",
           "gpt-4",
           "gpt-4-0613",
           "gpt-3.5-turbo",
           "gpt-3.5-turbo-0125",
           "gpt-3.5-turbo-1106",
           "gpt-3.5-turbo-0613",
           "llama2",
           "qwen: 1.8b",
       ]:
           return True
       return False
  2. Understanding the generate_stream Method: The generate_stream method in knowledge_based_brain_qa.py is responsible for generating a stream of responses asynchronously. The method uses the conversational_qa_chain to process the question and chat history, and it yields data as chunks of responses are generated.

    Here is a breakdown of how the asynchronous callback returns the answer:

    • Initialization: The method initializes the conversational QA chain and transforms the chat history.
    • Streaming Responses: It iterates over the chunks of responses generated by the conversational_qa_chain.
    • Handling Chunks: For each chunk, it checks if the model is compatible with function calling. If so, it processes the chunk to extract the answer, citations, follow-up questions, and thoughts.
    • Yielding Data: It yields the updated chat history as a JSON string.
    • Saving the Answer: Finally, it saves the answer using the save_answer method.

    Here is the relevant code snippet with added comments for clarity:

    async def generate_stream(
       self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
    ) -> AsyncIterable:
       conversational_qa_chain = self.knowledge_qa.get_chain()
       transformed_history, streamed_chat_history = (
           self.initialize_streamed_chat_history(chat_id, question)
       )
       response_tokens = ""
       sources = []
       citations = []
       first = True
       config = {"metadata": {"conversation_id": str(chat_id)}}
    
       async for chunk in conversational_qa_chain.astream(
           {
               "question": question.question,
               "chat_history": transformed_history,
               "custom_personality": (
                   self.prompt_to_use.content if self.prompt_to_use else None
               ),
           },
           config=config,
       ):
           if not streamed_chat_history.metadata:
               streamed_chat_history.metadata = {}
    
           if self.model_compatible_with_function_calling(model=self.model):
               if chunk.get("answer"):
                   if first:
                       gathered = chunk["answer"]
                       first = False
                   else:
                       gathered = gathered + chunk["answer"]
                       if (
                           gathered.tool_calls
                           and gathered.tool_calls[-1].get("args")
                           and "answer" in gathered.tool_calls[-1]["args"]
                       ):
                           answer = gathered.tool_calls[-1]["args"]["answer"]
                           difference = answer[len(response_tokens):]
                           streamed_chat_history.assistant = difference
                           response_tokens = answer
                           yield f"data: {json.dumps(streamed_chat_history.dict())}"
                           if (
                               gathered.tool_calls
                               and gathered.tool_calls[-1].get("args")
                               and "citations" in gathered.tool_calls[-1]["args"]
                           ):
                               citations = gathered.tool_calls[-1]["args"]["citations"]
                           if (
                               gathered.tool_calls
                               and gathered.tool_calls[-1].get("args")
                               and "followup_questions" in gathered.tool_calls[-1]["args"]
                           ):
                               followup_questions = gathered.tool_calls[-1]["args"][
                                   "followup_questions"
                               ]
                               streamed_chat_history.metadata["followup_questions"] = (
                                   followup_questions
                               )
                           if (
                               gathered.tool_calls
                               and gathered.tool_calls[-1].get("args")
                               and "thoughts" in gathered.tool_calls[-1]["args"]
                           ):
                               thoughts = gathered.tool_calls[-1]["args"]["thoughts"]
                               streamed_chat_history.metadata["thoughts"] = thoughts
           else:
               if chunk.get("answer"):
                   response_tokens += chunk["answer"].content
                   streamed_chat_history.assistant = chunk["answer"].content
                   yield f"data: {json.dumps(streamed_chat_history.dict())}"
    
           if chunk.get("docs"):
               sources = chunk["docs"]
    
       sources_list = generate_source(sources, self.brain_id, citations)
       serialized_sources_list = [source.dict() for source in sources_list]
       streamed_chat_history.metadata["sources"] = serialized_sources_list
       yield f"data: {json.dumps(streamed_chat_history.dict())}"
       self.save_answer(question, response_tokens, streamed_chat_history, save_answer)

    The answer comes from the conversational_qa_chain, which processes the question and chat history to generate a response. The generate_source method is used to manage the sources and citations within this method [1][2].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

When add "ollama/llama2" and "ollama/qwen:1.8b" to the model_compatible_with_function_calling method to ensure that these models are recognized as compatible with function calling. From the debug logs, it can be seen that OpenAI has been called (a connection error occurred due to the server being unable to connect to OpenAI), but I would prefer to use the local Ollama instead. There was no exception for this openai before modifying model_compatible_with_function_calling,So after the modification (with the addition of olama/lama2 and olama/qwen: 1.8b models in model_compatible_with'function_calling), why not call the local olama and instead call openai?

here is the code of model_compatible_with_function_calling:

def model_compatible_with_function_calling(self):
    if self.model in [
        "gpt-4o",
        "gpt-4-turbo",
        "gpt-4-turbo-2024-04-09",
        "gpt-4-turbo-preview",
        "gpt-4-0125-preview",
        "gpt-4-1106-preview",
        "gpt-4",
        "gpt-4-0613",
        "gpt-3.5-turbo",
        "gpt-3.5-turbo-0125",
        "gpt-3.5-turbo-1106",
        "gpt-3.5-turbo-0613",
            "ollama/llama2",
            "ollama/qwen:1.8b"
    ]:
        return True
    return False

here is the log:

|File "/usr/local/lib/python3.11/site packages/openai/_base_client. py", line 1615, in _retry_request
|Return await self_request(
|^^^^^^^^^^^^^^^^^^^^
|File "/usr/local/lib/python3.11/site packages/openai/_base_client. py", line 1556, in _request
|Raise APIConnectionError (request=request) from err
|Openai APIConnectionError: Connection error.

log of ChatLiteLLM: ''' ChatLiteLLM(client=<module 'litellm' from '/usr/local/lib/python3.11/site-packages/litellm/init.py'>, model='ollama/qwen:1.8b', openai_api_key='tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ', azure_api_key='', anthropic_api_key='nu ll', replicate_api_key='', cohere_api_key='', openrouter_api_key='', api_base='http://172.17.0.1:11434', temperature=0.0, huggingface_api_key='', together_ai_api_key='') '''

dosubot[bot] commented 3 months ago

The issue of OpenAI being called instead of the local Ollama after adding "ollama/llama2" and "ollama/qwen:1.8b" to the model_compatible_with_function_calling method is likely due to the OLLAMA_API_BASE_URL not being properly configured or activated.

To ensure that the local Ollama is called, follow these steps:

  1. Uncomment and Set the OLLAMA_API_BASE_URL: Make sure that the OLLAMA_API_BASE_URL is uncommented and correctly set in your .env file. This URL should point to the local Ollama API.

    # LOCAL
    OLLAMA_API_BASE_URL=http://host.docker.internal:11434 # Uncomment to activate ollama. This is the local url for the ollama api
  2. Verify the _create_llm Method: Ensure that the _create_llm method in the HeadlessQA class is correctly configured to use the local Ollama API when the model name starts with "ollama".

    def _create_llm(
       self,
       model,
       temperature=0,
       streaming=False,
       callbacks=None,
    ) -> BaseChatModel:
       api_base = None
       if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
           api_base = self.brain_settings.ollama_api_base_url
           logger.debug("Creating Ollama LLM with model: %s and API base: %s", model, api_base)
       else:
           logger.debug("Creating OpenAI LLM with model: %s", model)
    
       return ChatLiteLLM(
           temperature=temperature,
           model=model,
           streaming=streaming,
           verbose=True,
           callbacks=callbacks,
           max_tokens=self.max_tokens,
           api_base=api_base,
       )
  3. Check the Fallback Mechanism: Ensure that the fallback mechanism is not defaulting to OpenAI due to a misconfiguration. The embeddings property in the QuivrRAG class should prioritize Ollama if the OLLAMA_API_BASE_URL is set.

    @property
    def embeddings(self):
       if self.brain_settings.ollama_api_base_url:
           return OllamaEmbeddings(
               base_url=self.brain_settings.ollama_api_base_url
           )  # pyright: ignore reportPrivateUsage=none
       else:
           return OpenAIEmbeddings()
  4. Enable Debug Logging: Use the get_logger function to add debug statements in the code to trace why OpenAI is being called. This will help you understand the decision-making process.

    from logger import get_logger
    
    logger = get_logger(__name__)
    
    def _create_llm(
       self,
       model,
       temperature=0,
       streaming=False,
       callbacks=None,
    ) -> BaseChatModel:
       api_base = None
       if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
           api_base = self.brain_settings.ollama_api_base_url
           logger.debug("Creating Ollama LLM with model: %s and API base: %s", model, api_base)
       else:
           logger.debug("Creating OpenAI LLM with model: %s", model)
    
       return ChatLiteLLM(
           temperature=temperature,
           model=model,
           streaming=streaming,
           verbose=True,
           callbacks=callbacks,
           max_tokens=self.max_tokens,
           api_base=api_base,
       )

By ensuring that the OLLAMA_API_BASE_URL is correctly set and the _create_llm method is properly configured, you should be able to use the local Ollama instead of defaulting to OpenAI [1][2][3].

To continue talking to Dosu, mention @dosu.

andyzhangwp commented 3 months ago

_The logs for create_llm were not printed, but the logs for get_chain and embeddings were printed and displayed as local URLs and model names. ··· ===========embeddings====base_url= http://172.17.0.1:11434. ... backend-core | ===========get_chain===api_base= http://172.17.0.1:11434. backend-core | ===========get_chain====model= ollama/qwen:1.8b. ··· and Ollama is running:

curl http://172.17.0.1:11434
Ollama is running[root@iZ2ze4s4xazmovh4mc9g9pZ brain]# 

openai exception log: backend-core | | Traceback (most recent call last): backend-core | | File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap backend-core | | await func() backend-core | | File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 250, in stream_response backend-core | | async for chunk in self.body_iterator: backend-core | | File "/code/modules/brain/knowledge_brain_qa.py", line 352, in generate_stream backend-core | | async for chunk in conversational_qa_chain.astream( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2900, in astream backend-core | | async for chunk in self.atransform(input_aiter(), config, kwargs): backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2883, in atransform backend-core | | async for chunk in self._atransform_stream_with_config( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1980, in _atransform_stream_with_config backend-core | | chunk: Output = await asyncio.create_task( # type: ignore[call-arg] backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2853, in _atransform backend-core | | async for output in final_pipeline: backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 3315, in atransform backend-core | | async for chunk in self._atransform_stream_with_config( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1980, in _atransform_stream_with_config backend-core | | chunk: Output = await asyncio.create_task( # type: ignore[call-arg] backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 3302, in _atransform backend-core | | chunk = AddableDict({step_name: task.result()}) backend-core | | ^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 3285, in get_next_chunk backend-core | | return await py_anext(generator) backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2883, in atransform backend-core | | async for chunk in self._atransform_stream_with_config( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1980, in _atransform_stream_with_config backend-core | | chunk: Output = await asyncio.create_task( # type: ignore[call-arg] backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 2853, in _atransform backend-core | | async for output in final_pipeline: backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 4748, in atransform backend-core | | async for item in self.bound.atransform( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py", line 1334, in atransform backend-core | | async for output in self.astream(final, config, kwargs): backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 319, in astream backend-core | | raise e backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 297, in astream backend-core | | async for chunk in self._astream( backend-core | | File "/usr/local/lib/python3.11/site-packages/langchain_openai/chat_models/base.py", line 579, in _astream backend-core | | response = await self.async_client.create(messages=message_dicts, **params) backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/resources/chat/completions.py", line 1181, in create backend-core | | return await self._post( backend-core | | ^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1790, in post backend-core | | return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1493, in request backend-core | | return await self._request( backend-core | | ^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1546, in _request backend-core | | return await self._retry_request( backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1615, in _retry_request backend-core | | return await self._request( backend-core | | ^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1546, in _request backend-core | | return await self._retry_request( backend-core | | ^^^^^^^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1615, in _retry_request backend-core | | return await self._request( backend-core | | ^^^^^^^^^^^^^^^^^^^^ backend-core | | File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1556, in _request backend-core | | raise APIConnectionError(request=request) from err backend-core | | openai.APIConnectionError: Connection error. backend-core | +------------------------------------

andyzhangwp commented 3 months ago
  1. Log (embeddings====base_URL= http://172.17.0.1:11434 )It can be seen that the base_URL in the embeddings method points to the local URL

  2. Logs (get_chain====model=olama/lama2 and api_base=' http://172.17.0.1:11434 It can be seen that the api_base in the get_chain method points to the local URL, and the model is the local model olama/lama2 ''' INFO: 127.0.0.1:57506 - "GET /healthz HTTP/1.1" 200 OK INFO: 113.118.173.118:16728 - "OPTIONS /chat/8483c3d6-f941-4716-8cb5-84d4ea3b93b8/question/stream?brain_id=1737bf90-7745-4c6f-9efe-6b5a685965a2 HTTP/1.1" 200 OK INFO: 113.118.173.118:16731 - "GET /onboarding HTTP/1.1" 200 OK INFO: 113.118.173.118:16729 - "GET /user HTTP/1.1" 200 OK ==========KnowledgeBrainQA======init ==========KnowledgeBrainQA======increase_usage_user ==========KnowledgeBrainQA======calculate_pricing ===========embeddings====base_url= http://172.17.0.1:11434. INFO: 113.118.173.118:16728 - "POST /chat/8483c3d6-f941-4716-8cb5-84d4ea3b93b8/question/stream?brain_id=1737bf90-7745-4c6f-9efe-6b5a685965a2 HTTP/1.1" 200 OK ==========KnowledgeBrainQA======generate_stream ===========get_chain===api_base= http://172.17.0.1:11434. ===========get_chain====model= ollama/llama2. ========xxxxxxxxxxxx=====get_chain, loaded_memory:mapper={ chat_history: RunnableLambda(lambda x: self.filter_history(x['chat_history'])), question: RunnableLambda(lambda x: x['question']) } , standalone_question:{'standalone_question': { question: RunnableLambda(...), chat_history: RunnableLambda(itemgetter('chat_history')) } | PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. Keep as much details as possible from previous messages. Keep entity names and all. \n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:') | ChatLiteLLM(client=<module 'litellm' from '/usr/local/lib/python3.11/site-packages/litellm/init.py'>, model='ollama/llama2', openai_api_key='tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ', azure_api_key='', anthropic_api_key='null' , replicate_api_key='', cohere_api_key='', openrouter_api_key='', api_base='http://172.17.0.1:11434', temperature=0.0, huggingface_api_key='', together_ai_api_key='') | StrOutputParser()}, retrieved_documents:{'docs': RunnableLambda(itemgetter('standalone_question')) | ContextualCompressionRetriever(base_compressor=FlashrankRerank(client=<flashrank.Ranker.Ranker object at 0x7f9d9e4ff690>, top_n=10, model='ms-marco-TinyBERT-L-2-v2'), base_retriever=VectorStoreRetriever(tags=['CustomSupabaseVectorStore', 'Ol lamaEmbeddings'], vectorstore=<vectorstore.supabase.CustomSupabaseVectorStore object at 0x7f9d9eed4dd0>)), 'question': <function QuivrRAG.get_chain.. at 0x7f9d9df21760>, 'custom_instructions': <function QuivrRAG.get_chain.<loca ls>. at 0x7f9d9df219e0>}, answer:{'answer': { context: RunnableLambda(...), question: RunnableLambda(itemgetter('question')), custom_instructions: RunnableLambda(itemgetter('custom_instructions')) '''

  3. This LiteLLM log indicates calling the local Ollama generate API. ''' POST Request Sent from LiteLLM: curl -X POST \ http://172.17.0.1:11434/api/generate \ -d '{'model': 'llama2', 'prompt': "Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. Keep as much details as possible from previous messages. Keep entity names and all. \n\nChat History:\n[HumanMessage(content='this is a question'),

DEBUG:httpx:load_ssl_context verify=True cert=None trust_env=True http2=False DEBUG:httpx:load_verify_locations cafile='/usr/local/lib/python3.11/site-packages/certifi/cacert.pem' DEBUG:httpcore.connection:connect_tcp.started host='172.17.0.1' port=11434 local_address=None timeout=6000 socket_options=None DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f9d9eede410> DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']> DEBUG:httpcore.http11:send_request_headers.complete DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']> DEBUG:httpcore.http11:send_request_body.complete DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']> DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/x-ndjson'), (b'Date', b'Fri, 07 Jun 2024 01:07:32 GMT'), (b'Transfer-Encoding', b'chunked')]) INFO:httpx:HTTP Request: POST http://172.17.0.1:11434/api/generate "HTTP/1.1 200 OK" DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']> '''

  1. This LiteLLM log indicates asynchronous processing of chunks and returns content. ''' 01:12:18 - LiteLLM:DEBUG: utils.py:1054 - final returned processed chunk: ModelResponse(id='chatcmpl-349d86d7-d584-46cf-b3fb-5347a6a4a772', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='?', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1717722738, model='llama2', object='chat.completion.chunk', system_fingerprint=None) DEBUG:LiteLLM:final returned processed chunk: ModelResponse(id='chatcmpl-349d86d7-d584-46cf-b3fb-5347a6a4a772', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='?', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1717722738, model='llama2', object='chat.completion.chunk', system_fingerprint=None) 01:12:18 - LiteLLM:DEBUG: utils.py:1054 - Logging Details LiteLLM-Async Success Call: None DEBUG:LiteLLM:Logging Details LiteLLM-Async Success Call: None 01:12:18 - LiteLLM:DEBUG: utils.py:1413 - Model=llama2; DEBUG:LiteLLM:Model=llama2; 01:12:44 - LiteLLM:DEBUG: utils.py:1054 - value of async chunk: {"model":"llama2","created_at":"2024-06-07T01:12:44.631141821Z","response":"","done":true,"done_reason":"stop","context":[518,25580,29962,3532,14816,29903,29958,5299,829,14816,29903,6778,13,13,29954,5428,278,1494,14983,322,263,1101,701,1139,29892,337,24588,559,278,1101,701,1139,304,367,263,2317,18785,1139,29892,297,967,2441,4086,29889,19152,408,1568,4902,408,1950,515,3517,7191,29889,19152,7855,2983,322,599,29889,29871,13,13,1451,271,5298,29901,13,29961,29950,7889,3728,29898,3051,2433,31530,30923,30210,31415,235,183,169,30659,236,165,160,30392,30923,31022,29973,5477,319,29902,3728,29898,3051,2433,1495,29962,13,29943,2952,5020,10567,29901,29871,31530,30923,30210,31415,235,183,169,30659,236,165,160,30392,30923,31022,13,11042,18785,1139,29901,518,29914,25580,29962,13,26036,894,29901,1128,1568,338,3219,29915,29879,6782,27684,29973],"total_duration":416650857735,"load_duration":49959557076,"prompt_eval_count":131,"prompt_eval_duration":54868685000,"eval_count":13,"eval_duration":311526899000} '''
  2. The following log shows an exception in OpenAI ’‘’ 01:12:44 - LiteLLM:DEBUG: utils.py:1054 - Looking up model=ollama/llama2 in model_cost_map DEBUG:LiteLLM:Looking up model=ollama/llama2 in model_cost_map INFO: 127.0.0.1:60036 - "GET /healthz HTTP/1.1" 200 OK 01:12:44 - LiteLLM:DEBUG: utils.py:1054 - Success: model=ollama/llama2 in model_cost_map DEBUG:LiteLLM:Success: model=ollama/llama2 in model_cost_map 01:12:44 - LiteLLM:DEBUG: utils.py:1054 - prompt_tokens=96; completion_tokens=11 DEBUG:LiteLLM:prompt_tokens=96; completion_tokens=11 01:12:44 - LiteLLM:DEBUG: utils.py:1054 - Returned custom cost for model=ollama/llama2 - prompt_tokens_cost_usd_dollar: 0.0, completion_tokens_cost_usd_dollar: 0.0 DEBUG:LiteLLM:Returned custom cost for model=ollama/llama2 - prompt_tokens_cost_usd_dollar: 0.0, completion_tokens_cost_usd_dollar: 0.0 01:12:44 - LiteLLM:DEBUG: utils.py:1054 - final cost: 0.0; prompt_tokens_cost_usd_dollar: 0.0; completion_tokens_cost_usd_dollar: 0.0 --More--DEBUG:LiteLLM:final cost: 0.0; prompt_tokens_cost_usd_dollar: 0.0; completion_tokens_cost_usd_dollar: 0.0 01:12:44 - LiteLLM:DEBUG: utils.py:2136 - Model=llama2; cost=0.0 DEBUG:LiteLLM:Model=llama2; cost=0.0 DEBUG:httpcore.http11:receive_response_body.complete DEBUG:httpcore.http11:response_closed.started DEBUG:httpcore.http11:response_closed.complete DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.17.0.1:11434 DEBUG:urllib3.connectionpool:http://172.17.0.1:11434 "POST /api/embeddings HTTP/1.1" 200 None DEBUG:httpcore.connection:close.started DEBUG:httpcore.connection:close.complete DEBUG:httpcore.connection:close.started DEBUG:httpcore.connection:close.complete DEBUG:httpcore.connection:connect_tcp.started host='59.110.16.36' port=54321 local_address=None timeout=120 socket_options=None DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f9d9e581b90> DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']> DEBUG:httpcore.http11:send_request_headers.complete DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']> DEBUG:httpcore.http11:send_request_body.complete DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']> DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json; charset=utf-8'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'Date', b'Fri, 07 Jun 2024 01:13:10 GMT'), (b'Server', b'postgrest/12.0.1'), (b'Content-Range', b'0-5/'), (b'Content-Profile', b'public'), (b'Access-Control-Allow-Origin', b''), (b'X-Kong-Upstream-Latency', b'189'), (b'X-Kong-Proxy-Latency', b'6'), (b'Via', b'kong/2.8.1')]) INFO:httpx:HTTP Request: POST http://59.110.16.36:54321/rest/v1/rpc/match_vectors "HTTP/1.1 200 OK" DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']> DEBUG:httpcore.http11:receive_response_body.complete DEBUG:httpcore.http11:response_closed.started DEBUG:httpcore.http11:response_closed.complete DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'content': "\nWhen answering use markdown to make it concise and neat.\nUse the following pieces of context from files provided by the user that are store in a brain to answer the users question in the same language as the user question. Your name is Quivr. You're a helpful assistant. \nIf you don't know the answer with the context provided from the files, just say that you don't know, don't try to make up an answer.\nUser instruction to follow if provided to answer: None\n", 'role': 'system'}, {'content': "\nContext:\nSource: 0 \n Filename: Plan Artemis Sept 2020.pdf Content: # Initial Human Mars Mission\n\n|Distance to Mars|140 million miles from Earth|\n|---|---|\n|Challenges|More dangerous levels and types of radiation in deep space|\n\n# NASA's Plan for Sustained Lunar Exploration and Development\n\nCommercial Partnerships\nWill result in opportunities for the first international and commercial astronauts on the lunar surface\n---\n#\n\n# NASA's Plan for Sustained Lunar Exploration and Development\n\n# NASA's Plan for Sustained Lunar Exploration and Developme 。。。。。。 leading the development of clear policy principles to support civil space exploration with an initial emphasis on the Artemis program.\n---\n#\n\n# Document\n\n# NASA's Plan for Sustained Lunar Exploration and Development\n\n|Establishment of Predictable and Safe Process for Space Resource Extraction|Yes|\n|---|---|\n|Plan for Sustained Lunar Presence|Yes|\n|First Human Mission to Mars|Yes|\n|Premium Activated|Yes|\n|License Premier|No|\n---\n#\n\n# Artemis Generation\n\n# Artemis Generation\n\nFigure 1: Concept image of an early Artemis expedition on the Lunar surface.\n\nUser Question: Original Question: How much is Ma's transfer fee?\nAnswer:\n", 'role': 'user'}], 'model': 'ollama/llama2', 'max_tokens': 1000, 'n': 1, 'stream': True, 'temperature': 0.1, 'tool_choice': {'type': 'function', 'function': {'name': 'cited_answer'}}, 'tools': [{'type': 'function', 'function': {'name': 'cited_answer', 'description': 'Answer the user question based only on the given sources, and cite the sources used.', 'parameters': {'type': 'object', 'properties': {'answer': {'description': 'The answer to the user question, which is based only on the given sources.', 'type': 'string'}, 'citations': {'description': 'The integer IDs of the SPECIFIC sources which justify the answer.', 'type': 'array', 'items': {'type': 'integer'}}}, 'required': ['answer', 'citations']}}}]}} DEBUG:httpcore.connection:connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=None socket_options=None --More--DEBUG:httpcore.connection:connect_tcp.failed exception=ConnectError(OSError('All connection attempts failed')) DEBUG:openai._base_client:Encountered Exception Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 69, in map_httpcore_exceptions yield File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 373, in handle_async_request resp = await self._pool.handle_async_request(req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request raise exc from None File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request response = await connection.handle_async_request( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection.py", line 99, in handle_async_request raise exc File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection.py", line 76, in handle_async_request stream = await self._connect(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection.py", line 122, in _connect stream = await self._network_backend.connect_tcp(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/auto.py", line 30, in connect_tcp return await self._backend.connect_tcp( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 114, in connect_tcp with map_exceptions(exc_map): File "/usr/local/lib/python3.11/contextlib.py", line 155, in exit self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions raise to_exc(exc) from exc httpcore.ConnectError: All connection attempts failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1522, in _request response = await self._client.send( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1661, in send response = await self._send_handling_auth( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1689, in _send_handling_auth response = await self._send_handling_redirects( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects response = await self._send_single_request(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1763, in _send_single_request response = await transport.handle_async_request(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 372, in handle_async_request with map_httpcore_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in exit self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ConnectError: All connection attempts failed DEBUG:openai._base_client:1 retry left INFO:openai._base_client:Retrying request to /chat/completions in 0.791249 seconds ‘’‘

In summary, no issues were found with the local configuration, and it is true that the local model was also called. However, the openai exception log mentioned in point 5 also indicates that openai was called. From a process perspective, it occurred after the asynchronous chunk processing was completed. If not cleared, which place triggered and called the/chat/completion of openai?

dosubot[bot] commented 3 months ago

The issue you're encountering, where OpenAI's API is being called despite the local Ollama model being configured, is likely due to a fallback mechanism or a misconfiguration in the code that handles the language model selection. Here are the steps to ensure that only the local Ollama model is used and to prevent any calls to OpenAI:

  1. Ensure OLLAMA_API_BASE_URL is Set Correctly: You've already confirmed that the OLLAMA_API_BASE_URL is set correctly and that the local Ollama model is being used in some parts of the code. This is good.

  2. Check the model_compatible_with_function_calling Method: Ensure that the models "ollama/llama2" and "ollama/qwen:1.8b" are added to the model_compatible_with_function_calling method, which you have already done.

  3. Verify the _create_llm Method: Ensure that the _create_llm method in the HeadlessQA class is correctly configured to use the local Ollama API when the model name starts with "ollama". This method should set the api_base to self.brain_settings.ollama_api_base_url.

    def _create_llm(
       self,
       model,
       temperature=0,
       streaming=False,
       callbacks=None,
    ) -> BaseChatModel:
       api_base = None
       if self.brain_settings.ollama_api_base_url and model.startswith("ollama"):
           api_base = self.brain_settings.ollama_api_base_url
           logger.debug("Creating Ollama LLM with model: %s and API base: %s", model, api_base)
       else:
           logger.debug("Creating OpenAI LLM with model: %s", model)
    
       return ChatLiteLLM(
           temperature=temperature,
           model=model,
           streaming=streaming,
           verbose=True,
           callbacks=callbacks,
           max_tokens=self.max_tokens,
           api_base=api_base,
       )
  4. Check the generate_stream Method: The generate_stream method in knowledge_based_brain_qa.py might be falling back to OpenAI's API under certain conditions. Ensure that this method is correctly configured to use the local Ollama model.

    async def generate_stream(
       self, chat_id: UUID, question: ChatQuestion, save_answer: bool = True
    ) -> AsyncIterable:
       conversational_qa_chain = self.knowledge_qa.get_chain()
       transformed_history, streamed_chat_history = (
           self.initialize_streamed_chat_history(chat_id, question)
       )
       response_tokens = ""
       sources = []
       citations = []
       first = True
       config = {"metadata": {"conversation_id": str(chat_id)}}
    
       async for chunk in conversational_qa_chain.astream(
           {
               "question": question.question,
               "chat_history": transformed_history,
               "custom_personality": (
                   self.prompt_to_use.content if self.prompt_to_use else None
               ),
           },
           config=config,
       ):
           if not streamed_chat_history.metadata:
               streamed_chat_history.metadata = {}
    
           if self.model_compatible_with_function_calling(model=self.model):
               if chunk.get("answer"):
                   if first:
                       gathered = chunk["answer"]
                       first = False
                   else:
                       gathered = gathered + chunk["answer"]
                       if (
                           gathered.tool_calls
                           and gathered.tool_calls[-1].get("args")
                           and "answer" in gathered.tool_calls[-1]["args"]
                       ):
                           answer = gathered.tool_calls[-1]["args"]["answer"]
                           difference = answer[len(response_tokens):]
                           streamed_chat_history.assistant = difference
                           response_tokens = answer
                           yield f"data: {json.dumps(streamed_chat_history.dict())}"
                           if (
                               gathered.tool_calls
                               and gathered.tool_calls[-1].get("args")
                               and "citations" in gathered.tool_calls[-1]["args"]
                           ):
                               citations = gathered.tool_calls[-1]["args"]["citations"]
                           if (
                               gathered.tool_calls
                               and gathered.tool_calls[-1].get("args")
                               and "followup_questions" in gathered.tool_calls[-1]["args"]
                           ):
                               followup_questions = gathered.tool_calls[-1]["args"][
                                   "followup_questions"
                               ]
                               streamed_chat_history.metadata["followup_questions"] = (
                                   followup_questions
                               )
                           if (
                               gathered.tool_calls
                               and gathered.tool_calls[-1].get("args")
                               and "thoughts" in gathered.tool_calls[-1]["args"]
                           ):
                               thoughts = gathered.tool_calls[-1]["args"]["thoughts"]
                               streamed_chat_history.metadata["thoughts"] = thoughts
           else:
               if chunk.get("answer"):
                   response_tokens += chunk["answer"].content
                   streamed_chat_history.assistant = chunk["answer"].content
                   yield f"data: {json.dumps(streamed_chat_history.dict())}"
    
           if chunk.get("docs"):
               sources = chunk["docs"]
    
       sources_list = generate_source(sources, self.brain_id, citations)
       serialized_sources_list = [source.dict() for source in sources_list]
       streamed_chat_history.metadata["sources"] = serialized_sources_list
       yield f"data: {json.dumps(streamed_chat_history.dict())}"
       self.save_answer(question, response_tokens, streamed_chat_history, save_answer)
  5. Check for Other Methods Making API Calls: Ensure that other methods such as make_completion, make_completion_without_streaming, and get_api_call_response_as_text in api_brain_qa.py are not making calls to OpenAI's API.

    async def make_completion(
       self,
       messages,
       functions,
       brain_id: UUID,
       recursive_count=0,
       should_log_steps=True,
    ) -> str | None:
       response = completion(
           model=self.model,
           temperature=self.temperature,
           max_tokens=self.max_tokens,
           messages=messages,
           functions=functions,
           stream=True,
           function_call="auto",
       )
  6. Disable OpenAI API Key: Set the OPENAI_API_KEY to a fake key to ensure that OpenAI's API is not called.

    OPENAI_API_KEY=tk-aabbccddAABBCCDDEeFfGgHhIiJKLmnopjklMNOPqQqQqQqQ

By following these steps, you should be able to ensure that only the local Ollama model is used and prevent any calls to OpenAI [1][2][3].

To continue talking to Dosu, mention @dosu.

github-actions[bot] commented 3 weeks ago

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.