Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
26.93k stars 2.7k forks source link

Embedding seems not working #279

Closed fdchiu closed 1 year ago

fdchiu commented 1 year ago

Steps: 1) Add a pdf file to hotdir with Python watch.py running 2) Check to make sure the file is processed through the terminal message running watch.py 3) Go back to anything-llm tab in browser a workspace is setup running already 4) Go to workspace setup and click/enable embedding of the newly added document 5) Ask GPT a question with regard to info that is only available in the document 6) GPT would not be able to answer the question

Please let me know anything missing in my steps?

Thanks!

timothycarambat commented 1 year ago

Do you happen to see the vector count on the sidebar increase post-embedding 🤔 ?

fdchiu commented 1 year ago

I did not see any indication as you described. When I select a local document to embed from the workspace, I saw the popup to confirm it. Once I confirm and close the popup and then come back to workspace setting -> Document, the previously selected doc for embedding still has a green icon in front of it. So the doc is NOT embedded somehow?

After further testing I was able to make 2 docs to be embedded . Here is what I have observed when the embedding fails: 1) OpenAI 401 error: authentication problem. I needed to enter openai key every time when I start the frontend so need to double check if the openai key is setup correctly. I also do have openai key setup in server's env file.

2) database issues:

Inserting vectorized chunks into LanceDB collection.
[Error: LanceDBError: Append with different schema: original=Field(id=0, name=vector, type=fixed_size_list:float:1536)
Field(id=1, name=id, type=string)
Field(id=2, name=url, type=string)
Field(id=3, name=title, type=string)
Field(id=4, name=docAuthor, type=string)
Field(id=5, name=description, type=string)
Field(id=6, name=docSource, type=string)
Field(id=7, name=chunkSource, type=string)
Field(id=8, name=published, type=string)
Field(id=9, name=wordCount, type=double)
Field(id=10, name=token_count_estimate, type=double)
Field(id=11, name=text, type=string)
 new=Field(id=0, name=vector, type=fixed_size_list:float:1536)
Field(id=1, name=id, type=string)
Field(id=2, name=url, type=string)
Field(id=3, name=title, type=string)
Field(id=4, name=description, type=string)
Field(id=5, name=published, type=string)
Field(id=6, name=wordCount, type=double)
Field(id=7, name=token_count_estimate, type=double)
Field(id=8, name=text, type=string)
]
addDocumentToNamespace LanceDBError: Append with different schema: original=Field(id=0, name=vector, type=fixed_size_list:float:1536)
Field(id=1, name=id, type=string)
Field(id=2, name=url, type=string)
Field(id=3, name=title, type=string)
Field(id=4, name=docAuthor, type=string)
Field(id=5, name=description, type=string)
Field(id=6, name=docSource, type=string)
Field(id=7, name=chunkSource, type=string)
Field(id=8, name=published, type=string)
Field(id=9, name=wordCount, type=double)
Field(id=10, name=token_count_estimate, type=double)
Field(id=11, name=text, type=string)
 new=Field(id=0, name=vector, type=fixed_size_list:float:1536)
Field(id=1, name=id, type=string)
Field(id=2, name=url, type=string)
Field(id=3, name=title, type=string)
Field(id=4, name=description, type=string)
Field(id=5, name=published, type=string)
Field(id=6, name=wordCount, type=double)
Field(id=7, name=token_count_estimate, type=double)
Field(id=8, name=text, type=string)

Failed to vectorize website-github.com/article-_AUTOMATIC1111_stable-diffusion-webui_wiki_API.json

I needed to check server log to find out the error.

timothycarambat commented 1 year ago

Are you running this in docker or in development mode? If you are running this in Docker the correct env file placement is docker/.env. Otherwise its server/.env and if in development its server/.env.development.

Confusing, I know. Will be resolved by #281

fdchiu commented 1 year ago

I was running in development mode.

I assume you were talking about the LanceDB issue?

The issue with saving the api key (openAI, pinecone etc.) in the frontend once one has entered them and also in server .env seems bazar .

Nexttime when I restart the frontend, I have to reenter the keys and config. If there is anything I can help you debug, let me know. I was running on localhost.

timothycarambat commented 1 year ago

Yeah, so if running in development mode and you make an edit to the code the backend will hot-reload, which also means that your process.env is unset! So that is why the keys keep clearing for you.

You need to set a server/.env.development and just put the proper keys in - that way on server reloads/restarts it you wont have to re-input your credentials.

# Example /server/.env.development

SERVER_PORT=3001
CACHE_VECTORS="true"
JWT_SECRET="some-random-JWT-string" # Please generate random string at least 12 chars long.

###########################################
######## LLM API SElECTION ################
###########################################
LLM_PROVIDER='openai'
OPEN_AI_KEY=sk-ABC123OPENAI
OPEN_MODEL_PREF='gpt-3.5-turbo'

###########################################
######## Vector Database Selection ########
###########################################

# Enable all below if you are using vector database: Pinecone.
VECTOR_DB="pinecone"
PINECONE_ENVIRONMENT=us-xxxx-gcp
PINECONE_API_KEY='123-456'
PINECONE_INDEX=your-index-name
timothycarambat commented 1 year ago

As for the lance issue, i think it might be trying to assume a LanceDB environment exists but because the vector db selection keeps changing that is what is going wrong. Ill hold off on commenting on that issue for now. The primary issue for you is the hot-reloading of the backend wiping out your process.envs

fdchiu commented 1 year ago

@timothycarambat I do have the keys and configuration setup in .env with server. But when the frontend is started, the settings for the keys are empty and I have to manually enter them again.

But I'll retest.