Closed rchan26 closed 2 months ago
Might be useful to dig into the different kinds of response modes that llama-index has: https://gpt-index.readthedocs.io/en/stable/core_modules/query_modules/response_synthesizers/usage_pattern.html
Maybe "simple_summarize" is something to explore which truncates all text chunks to fit into a single LLM prompt. I think in this case, we can make set chunk_overlap_ratio=0
as we're going to fit it all in a single call.
Could try adding "Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly." to the system prompt.
This is a good idea!
From playing around with different models, it feels like better (typically this means larger) models tend to not really do this. The default prompt from llama-index does say to return the original answer if the new context is not useful. It seems like larger models will not say thanks.
By default, llama-index will try to get the LLM to refine its answer by providing more context and its previous answer. If it does this, it sometimes gives a thanks to having more context, e.g. saying "Thank you for providing additional context!" or "Thank you for providing more context!" at the start of it's response.
Another related issue is that sometimes the additional context during refinement is not useful, and the LLM will mention that it was not useful and the original answer stands.
This could be confusing for users (as they don't know a refinement is happening) as this gets done by llama-index on the fly. Either we figure out a way to remove these kind of acknowledgements from the response automatically, or we ensure that we don't try to refine. The later is probably easier to do (maybe increasing
chunk_size_limit
argument in theServiceContext
will ensure this), but potentially harmful to the model if it turns out that refining is actually beneficial and better than just having a larger chunk size limit to begin with.