[BUG]: RAG use with large context windows/many files

Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.

https://useanything.com

MIT License

17.44k stars 1.87k forks source link

[BUG]: RAG use with large context windows/many files #1187

Open RahSwe opened 2 months ago

RahSwe commented 2 months ago

How are you running AnythingLLM?

Not listed

What happened?

Using Railway instance, with Gemini Pro 1.5 (1M context window).

I am trying to maximize the advantage of the context window by pinning many and large files. It however seems that ALLM does not provide all the pinned files in the request (+20 files). Also, only some of the pinned files appear in the citations. From the answers, it seems clear that some of the pinned files are not provided.

Are there known steps to reproduce?

I would guess use Gemini Pro 1.5 API and pin a lot of files, also having files not pinned, and verify that ALLM correctly submit all the pinned files, along with snippets from the unpinned files, and correctly reflecting the sources in citations.

timothycarambat commented 2 months ago

Are you sure there are files being omitted and it is not just the model ignoring the context? In the code there is no limit to how many files can be pinned, but there is still a context window and if it is over 1M tokens it will still be pruned from the chat.

Since you are on Railway you should see in the logs when a chat is sent if the context if too large for Google and it should say something like cannonballing context xxxx to xxxx or something along those likes - which indicates you still sent too much text and we had to truncate it

RahSwe commented 2 months ago

Hi Tim!

Yes, I am not sure if it is connected with a redeployment, but suddenly I have major issues in all workspaces getting reports from all over the organization. I have temporarily shut down the instance and have been testing around for an hour two, deleting all files and switching around different AI models (Claude 3-Opus or Gemini 1.5), different embedder (local or Open AI Large) and vector databases (local or Pinecone). But my server went from awesome to getting all kinds of things wrong from not knowing the information in embeddings and pinned documents.

I have made sure that I am within the 1M range, testing with a few smaller files only but it really seems to be hit and miss.

By the way, are there any downsides to using internal vector database and internal embedding model as opposed to the ones mentioned?

25 apr. 2024 kl. 16:31 skrev Timothy Carambat @.***>:

Are you sure there are files being omitted and it is not just the model ignoring the context? In the code there is no limit to how many files can be pinned, but there is still a context window and if it is over 1M tokens it will still be pruned from the chat.

Since you are on Railway you should see in the logs when a chat is sent if the context if too large for Google and it should say something like cannonballing context xxxx to xxxx or something along those likes - which indicates you still sent too much text and we had to truncate it

— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/1187#issuecomment-2077349658, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHCK7BMDNA5FVTUTWW4PY2DY7EHTZAVCNFSM6AAAAABGYQE6ZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXGM2DSNRVHA. You are receiving this because you authored the thread.

RahSwe commented 2 months ago

Hi, adding info that may be relevant, when I look at citations some of the pinned files are sometimes shown but not other pinned files (files that should also should have very high relevance).

25 apr. 2024 kl. 16:31 skrev Timothy Carambat @.***>:

Are you sure there are files being omitted and it is not just the model ignoring the context? In the code there is no limit to how many files can be pinned, but there is still a context window and if it is over 1M tokens it will still be pruned from the chat.

Since you are on Railway you should see in the logs when a chat is sent if the context if too large for Google and it should say something like cannonballing context xxxx to xxxx or something along those likes - which indicates you still sent too much text and we had to truncate it

— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/1187#issuecomment-2077349658, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHCK7BMDNA5FVTUTWW4PY2DY7EHTZAVCNFSM6AAAAABGYQE6ZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXGM2DSNRVHA. You are receiving this because you authored the thread.

RahSwe commented 2 months ago

This is a picture from continued testing of our server. It is very strange. Now instead of getting the answer from Gemini Pro, it seems to give a weird reference in the beginning of the answer. The answer is also wrong, not taking into account the most relevant document (pinned), which is also not listed in the citations. However another pinned document (AB 04), which contains no information on the current query, is listed.

[image: image.png]

On Thu, Apr 25, 2024 at 4:57 PM Richard Sahlberg @.***> wrote:

Hi, adding info that may be relevant, when I look at citations some of the pinned files are sometimes shown but not other pinned files (files that should also should have very high relevance).

25 apr. 2024 kl. 16:31 skrev Timothy Carambat @.***>:

Are you sure there are files being omitted and it is not just the model ignoring the context? In the code there is no limit to how many files can be pinned, but there is still a context window and if it is over 1M tokens it will still be pruned from the chat.

Since you are on Railway you should see in the logs when a chat is sent if the context if too large for Google and it should say something like cannonballing context xxxx to xxxx or something along those likes - which indicates you still sent too much text and we had to truncate it

— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/1187#issuecomment-2077349658, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHCK7BMDNA5FVTUTWW4PY2DY7EHTZAVCNFSM6AAAAABGYQE6ZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXGM2DSNRVHA . You are receiving this because you authored the thread.Message ID: @.***>

RahSwe commented 2 months ago

Hi,

Another question. In my troubleshooting of this issue, I am removing vectors and as I have understood it, is it necessary to delete all files as well when you are switching embedding model? Then re-upload and re-embed? That was the information in the box appearing when switching model but it seems unnecessary to have to remove and reupload all files?

If it is necessary, it would be great to have an option to delete all files in a folder without deleting the folder itself (I have about 30 folders).

BR Richard

25 apr. 2024 kl. 16:31 skrev Timothy Carambat @.***>:

Are you sure there are files being omitted and it is not just the model ignoring the context? In the code there is no limit to how many files can be pinned, but there is still a context window and if it is over 1M tokens it will still be pruned from the chat.

Since you are on Railway you should see in the logs when a chat is sent if the context if too large for Google and it should say something like cannonballing context xxxx to xxxx or something along those likes - which indicates you still sent too much text and we had to truncate it

— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/1187#issuecomment-2077349658, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHCK7BMDNA5FVTUTWW4PY2DY7EHTZAVCNFSM6AAAAABGYQE6ZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXGM2DSNRVHA. You are receiving this because you authored the thread.

timothycarambat commented 2 months ago

🤔 Hm, if a document is pinned then it for sure will be used in the messages sent to the model.

If you change the embedding model this will make all previously embedded documents and workspaces broken because the differences in embedder model outputs will make vector search impossible. So they would need to be deleted and re-embed.

This is where it occurs https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/chats/index.js#L93

Gemini pro has a context of https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/AiProviders/gemini/index.js#L59C16-L59C25 and because we manage the context window and snippets are appended to system the real limit is https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/AiProviders/gemini/index.js#L26 or ~ 157,286 tokens. Roughly 15%

The reason we put those limits is because some people like to put really really large inputs and some like to put really large system prompts. We have to cut it one way or another.

Im willing to bet because you has so many documents pinned and that context is still more than 100K tokens most of your documents are not making it to chat.

I am willing to bet that running /reset on a chat window to wipe the history will result in a better first-response. I think what we need to do is have the prompt window values be dynamic based on which value is larger.

If system is huge - that should take the majority of the window and shrink the allocation for user prompt. If the user prompt is large, then shrink the allocation for system.

Context window is limited so you cant have it both ways. That is for sure what is going on here.

RahSwe commented 2 months ago

Hi Tim, yes it sounds like a dynamic mode would be necessary or some kind of setting in the workspace area.We are typically maxing the context window-10 % with pinned files, having the user prompt and answer being a very small part of the intended context window usage.And please let me add another function wish - having a admin tab or such where you can adjust chosen settings of workspaces all at once. I have 25 workspaces where I sometimes update them all in the same way which becomes a bit tedious =)Med vänlig hälsningRichard SahlbergAdvokat/Partner Foyen Advokatfirma+46 737 19 51 @. | LinkedIn Our services are subject to General Terms and Conditions, which are available on our website here.This communication is confidential and is only intended for the use of the individual or entityto which it is directed. It may contain information that is privileged and exempt from disclosureunder applicable law. Please find Privacy Policy here for information on how we process personal data.25 apr. 2024 kl. 18:31 skrev Timothy Carambat @.>: 🤔 Hm, if a document is pinned then it for sure will be used in the messages sent to the model. If you change the embedding model this will make all previously embedded documents and workspaces broken because the differences in embedder model outputs will make vector search impossible. So they would need to be deleted and re-embed. This is where it occurs https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/chats/index.js#L93 Gemini pro has a context of https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/AiProviders/gemini/index.js#L59C16-L59C25 and because we manage the context window and snippets are appended to system the real limit is https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/AiProviders/gemini/index.js#L26 or ~ 157,286 tokens. Roughly 15% The reason we put those limits is because some people like to put really really large inputs and some like to put really large system prompts. We have to cut it one way or another. Im willing to bet because you has so many documents pinned and that context is still more than 100K tokens most of your documents are not making it to chat. I am willing to bet that running /reset on a chat window to wipe the history will result in a better first-response. I think what we need to do is have the prompt window values be dynamic based on which value is larger. If system is huge - that should take the majority of the window and shrink the allocation for user prompt. If the user prompt is large, then shrink the allocation for system. Context window is limited so you cant have it both ways. That is for sure what is going on here.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

RahSwe commented 2 months ago

Another thought on this - I have experimented in pinning beyond the max tokens in the AI model and I have received error messages from the model itself, that the total amount of tokens is too big. Just curious how that is possible if there are set limits/fractions in ALLM for different parts of the prompt.Med vänlig hälsningRichard SahlbergAdvokat/Partner Foyen Advokatfirma+46 737 19 51 @. | LinkedIn Our services are subject to General Terms and Conditions, which are available on our website here.This communication is confidential and is only intended for the use of the individual or entityto which it is directed. It may contain information that is privileged and exempt from disclosureunder applicable law. Please find Privacy Policy here for information on how we process personal data.25 apr. 2024 kl. 18:31 skrev Timothy Carambat @.>: 🤔 Hm, if a document is pinned then it for sure will be used in the messages sent to the model. If you change the embedding model this will make all previously embedded documents and workspaces broken because the differences in embedder model outputs will make vector search impossible. So they would need to be deleted and re-embed. This is where it occurs https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/chats/index.js#L93 Gemini pro has a context of https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/AiProviders/gemini/index.js#L59C16-L59C25 and because we manage the context window and snippets are appended to system the real limit is https://github.com/Mintplex-Labs/anything-llm/blob/dfaaf1680ff4de647de727fcd404ec044c5dd8e2/server/utils/AiProviders/gemini/index.js#L26 or ~ 157,286 tokens. Roughly 15% The reason we put those limits is because some people like to put really really large inputs and some like to put really large system prompts. We have to cut it one way or another. Im willing to bet because you has so many documents pinned and that context is still more than 100K tokens most of your documents are not making it to chat. I am willing to bet that running /reset on a chat window to wipe the history will result in a better first-response. I think what we need to do is have the prompt window values be dynamic based on which value is larger. If system is huge - that should take the majority of the window and shrink the allocation for user prompt. If the user prompt is large, then shrink the allocation for system. Context window is limited so you cant have it both ways. That is for sure what is going on here.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>