Farzad-R / LLM-Zero-to-Hundred

This repository contains different LLM chatbot projects (RAG, LLM agents, etc.) and well-known techniques for training and fine tuning LLMs.
193 stars 105 forks source link

RAG-GPT: Number of vectors in vectordb: 0 #14

Closed taraazin closed 2 months ago

taraazin commented 2 months ago

Hello, Running "upload_data_manually.py" does not create vectors in vectordb. I have successfully connected the model to azure and the chatbot works fine based on the pretrained model; however, I cannot upload my own data.

Farzad-R commented 2 months ago

Hello, Would you please provide more information? Which project are you referring to? And what is the error in the terminal?

taraazin commented 2 months ago

Thanks for your response.

I'm trying to run the RAG-GPT model. I have deployed both "gpt-35-turbo" and "text-embedding-ada-002" on azure. Right now, I do not get any errors and the chatbot works fine, but I cannot process the pdfs in "docs" folder. It looks like that "upload_data_manually.py" does not create vectors completely in "vectordb" folder. Upon running the script, chroma.sqlite3 is created and I get "Number of vectors in vectordb: 0".

Does it have something to do with my azure credentials or deployments?

Farzad-R commented 2 months ago

Sure! I hope the problem gets fixed quickly. No, if your chatbot works fine, the deployments should be ok. But to verify, run the GPT model and the embedding models on a separate notebook and make sure that you can get the desired response from them. However, if it had anything to do with them, you should have seen the related errors in the terminal. The only reason that crosses my mind is that the Langchain text extractor cannot extract any text from your documents. In this case, you would not see any error but also you would not see any content in the vector. For example, if you have scanned PDF files, they will be treated as images rather than text documents and therefore it will lead to an empty output. So, to verify this one, in a separate notebook, load the documents, pass them to Langchain loader, and check its output.

from langchain.document_loaders import PyPDFLoader
from pyprojroot import here
import os

data_directory = here(data/docs)
document_list = os.listdir(self.data_directory)
document_list = os.listdir(self.data_directory)
docs = []
for doc_name in document_list:
      docs.extend(PyPDFLoader(os.path.join(
           data_directory, doc_name)).load())
      doc_counter += 1

then check the contents of docs:

print("Number of loaded documents:", doc_counter)
print("Number of pages:", len(docs), "\n\n")
print(docs)

and verify the contents of it. My guess is that you won't see the content in there. Also, test it with the documents that I included in the project and let me know the results.

taraazin commented 2 months ago

Thanks for your time. I'm testing the chatbot with the same documents that you have included. So, upon running your test script, I get the following error:

Traceback (most recent call last): File "c:\Users\Administrator\Chatbots\LLM-Zero-to-Hundred\RAG-GPT\src\test.py", line 5, in <module> data_directory = here(data/docs) ^^^^ NameError: name 'data' is not defined

I changed the script to

from langchain.document_loaders import PyPDFLoader
from pyprojroot import here
import os

data_directory = here("data/docs")
document_list = os.listdir(data_directory)
docs = []
doc_counter = 0  

for doc_name in document_list:
    if doc_name.endswith('.pdf'):
        docs.extend(PyPDFLoader(os.path.join(data_directory, doc_name)).load())
        doc_counter += 1

print("Number of loaded documents:", doc_counter)
print("Number of pages:", len(docs), "\n\n")
print(docs)

and it shows me the content. Do you think that it might have something to do with my azure API version (i.e. 2023-06-01-preview). I am using the same gpt and embedding models as in the tutorial (gpt-35-turbo, text-embedding-ada-002) ?

p.s. I also checked if the content of the "docs" folder is called successfully by the config.yml file. Everything seems to work just fine!

Farzad-R commented 2 months ago

No, it is not related to OpenAI. The code has some difficulties with finding the location of the documents. I am not sure why it is happening on your end.

This error: Traceback (most recent call last): File "c:\Users\Administrator\Chatbots\LLM-Zero-to-Hundred\RAG-GPT\src\test.py", line 5, in <module> data_directory = here(data/docs) ^^^^ NameError: name 'data' is not defined clearly says that data folder cannot be found. Since I am using here() method, this should not have happened. But for some reason it is. And on the code, I noticed this change if doc_name.endswith('.pdf') which is again trying to extract solely just the PDF files.

So, whatever is happening is because the code could not properly find the data folder and the files in it. You mentioned that the code is working now. So, I am glad to hear it. But in case you face any more issues with that, that should have a quick and easy fix. Just make sure that the code is able to find the documents and that should solve the issue. Now that the problem is solved, please let me know if you still would need the issue to remain open or I can close it now.

taraazin commented 2 months ago

Thanks for your help!