FlowiseAI / Flowise

Drag & drop UI to build your customized LLM flow
https://flowiseai.com
Apache License 2.0
29.92k stars 15.45k forks source link

[BUG] Postgres Vector Store inverted similarity calculation #2198

Open vinibs opened 5 months ago

vinibs commented 5 months ago

Describe the bug When using Postgres as the Vector Store for a flow and querying it with a specified minimum score, it's not returning anything. Searching the code, I noticed that the Postgres node calculates the distance between the vectors and returns them sorted ascending, but the VectorStore to Document node expects the number to be a similarity value, not the distance, resulting in it discarding the most relevant results for my query. I'm making a local change to try changing this value returned by Postgres to be "1 - distance", which seems to be enough to fix this situation. If it doesn't bring other side effects, I can also make a PR for this little change.

To Reproduce Steps to reproduce the behavior:

  1. Create a flow
  2. Configure a Postgres database server
  3. Add the (Vector Store) Postgres node, with output set as "Postgres Vector Store" and set the connection up
  4. Add the VectorStore to Document node and set its input as being the Postgres node's output
  5. Pass "{{question}}" as the Query attribute of the VectorStore to Document node
  6. Add a simple custom function as an ending node and set the VectorStore to Document node's output as an input variable for the function
  7. Make the custom function only return the provided variable (for debugging purposes)
  8. Use the Upsert API to insert data into the vector store for at least two different messages (questions)
  9. Ask the chat one of the questions that were inserted into the Vector Store

Expected behavior The output should bring the stored vectors ordered by the most similar to the question, but instead it brings them ordered by the less similar. If a minimum score is passed to the VectorStore to Document node, for the exact same question, no result is brought from the query.

Screenshots The results when querying without setting the minimum score (it brings all entries): Screenshot 2024-04-16 at 11 30 55

The results when querying with a minimum score of 80% (it doesn't bring any data): Screenshot 2024-04-16 at 11 31 28

The VectorStore documents' log with the calculated similarity for this case (bringing the exact same question with a score of 0 while bringing non-related questions with greater scores - actually, greater distances): Screenshot 2024-04-16 at 11 33 30

Flow sql-test Chatflow.json

Setup

Additional context As mentioned before, it seems the Postgres node is calculating the distance instead of the similarity when querying the vector store. I'm currently testing changing this calculation to bringing "1 - distance" as the similarity score (or changing it directly on the query calculation), but I'm not aware about possible side effects this could cause, since I'm working with Flowise for just 4 days and am not very familiar to its resources. I'd like to confirm this issue before opening a pull request to fix it.

HenryHengZJ commented 5 months ago

The postgres vs in Flowise is using implementation from here and yes its using distance

There's another way to do this via here, and this allow you to do cosine, innerProduct or euclidean

We can change the implementation to use the latter one if that solve the issue

vinibs commented 4 months ago

Hi @HenryHengZJ, thanks for your answer. I'm not sure I get your point. Do you mean the distanceStrategy attribute? If so, there are two doubts that just raised about it:

The issue was actually regarding the fact that we pass a minimum similarity score to the block in the flow, but we compare it to the distance instead (which, instead of "the bigger, the better" is "the smaller, the better"), making the "minimum score" input not having the expected behavior. So, as I'm not very familiar with these distance strategies yet, do you think changing it would solve this situation or would it be better to actually change how the query is built to consider the similarity instead of the distance?

Weilin37 commented 2 months ago

To add to this, calculating cosine similarity on the exact same vector does not give a score of 1, but close to a score of 0.5

PolygonHealth commented 4 weeks ago

Is there a follow up on this?

Astriel commented 2 weeks ago

I would be interested to also be able to select the metric to fetch vectors from the pgvector database.