fynnfluegge / codeqai

Local first semantic code search and chat powered by vector embeddings and LLMs
Apache License 2.0
385 stars 46 forks source link

Indexing Error with codeqai on Conda Environment: Continuous Indexing Without Completion #38

Open TeomanEgeSelcuk opened 5 months ago

TeomanEgeSelcuk commented 5 months ago

While using the codeqai tool within a conda environment, I encountered an issue during the indexing process where it continuously attempts to index without completion. This problem occurred when I tried to utilize codeqai's search functionality in my project directory. Specifically, the error IndexError: list index out of range was thrown, indicating an issue with handling the document vector indexing. Below are the detailed steps to reproduce, along with the specific environment setup.

Steps to Reproduce:

  1. Installed codeqai using pip within a conda environment.
  2. Ran codeqai configure and configured the tool with the following settings:
    • Selected "y" for using local embedding models.
    • Chose "Instructor-Large" for the local embedding model.
    • Selected "N" for using local chat models and chose "OpenAI" with "gpt-4" as the remote LLM.
  3. Attempted to start the codeqai search by navigating to my project directory (2-006) that includes .m, .mat, .txt. files. Running codeqai search in the terminal.
  4. Received a message indicating no vector store was found for 2-006 and that initial indexing may take a few minutes. Shortly after, the indexing process started but then failed with an IndexError: list index out of range.

Expected Behavior:

The indexing process should be completed, allowing for subsequent searches within the codebase using codeqai.

Actual Behavior:

The application failed to complete the indexing process due to an IndexError in the vector indexing step, specifically indicating a problem with handling the document vectors.

Environment:

Full Terminal Output and Error

{GenericDirectory>}conda activate condaqai-env

(condaqai-env) {GenericDirectory>}codeqai search
Not a git repository. Exiting.

(condaqai-env) {GenericDirectory>}ls
'ls' is not recognized as an internal or external command,
operable program or batch file.

(condaqai-env) {GenericDirectory>}cd 2-006

(condaqai-env) {GenericDirectory}\2-006>codeqai search
No vector store found for 2-006. Initial indexing may take a few minutes.
⠋ 💾 Indexing vector store...Traceback (most recent call last):
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\Scripts\codeqai.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\__main__.py", line 5, in main
    app.run()
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\app.py", line 146, in run
    vector_store.index_documents(documents)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\vector_store.py", line 34, in index_documents
    self.db = FAISS.from_documents(documents, self.embeddings)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_core\vectorstores.py", line 508, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_community\vectorstores\faiss.py", line 960, in from_texts
    return cls.__from(
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_community\vectorstores\faiss.py", line 919, in __from
    index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: list index out of range
⠴ 💾 Indexing vector store...

Additional Context:

This issue seems to stem from the vector indexing process within the langchain-community package, possibly due to an empty or malformed document set being processed for vectorization. Given the configuration steps and the use of a conda environment, there might be specific dependencies or configurations that contribute to this problem.

fynnfluegge commented 5 months ago

Thanks for that detailed report! I think the cause is probably an empty split set for a document, as you also mentioned already.