enjalot / latent-scope

A scientific instrument for investigating latent spaces
MIT License
516 stars 18 forks source link

[bug] Nearest neighbor search errors if using non-default feature size for a text embedding model #49

Closed hydrosquall closed 3 weeks ago

hydrosquall commented 1 month ago

Bug reproduction steps

  1. Generate a scope in latentscope 0.3.0 using open-ai-text-embedding-3-small model, but don't use the default feature size of 1536, change it to something smaller 768.
  2. Visit the /datasets/<DATASET_NAME>/explore/<SCOPE_NAME> route , and use the "nearest neighbor search" feature
  3. Submitting terms in the search box has no effect. Upon checking server logs, you'll have a message like this

image

Bug explanation

I think the error message demonstrates why the button has no effect . The search term was embedded with the default model size of 1536 instead of with the size of the embeddings it'll be matched against https://github.com/enjalot/latent-scope/blob/b07a27bc6685f3b4b4287251bf094ee8e1767dc2/latentscope/server/search.py#L62

For nearest neighbors search to work, the query term's vector has to match the size of the embeddings for this scope

Possible fixes

  1. Have the frontend pass updated dimensions through to the dimensions GET parameter, vs
  2. The frontend shouldn't worry about this, the dimensions should be set based on the size of the features in the loaded model . In that case dimensions could be completely removed from the server's URL params since as far as I can see, the frontend does not send dimensions as a query parameter.

https://github.com/enjalot/latent-scope/blob/b07a27bc6685f3b4b4287251bf094ee8e1767dc2/web/src/pages/Explore.jsx#L359

enjalot commented 1 month ago

thanks for the detailed bug report, this does make sense and should be accounted for.

We track the dimensions for the embedding in the scope.json ( along with all the other metadata from the process) so I think it would be fine for the front end to call the nearest neighbor endpoint with the dims. The API needs to be updated to then pass the dimensions to the model. The providers that support matroyshka take it as a parameter to embed.

hydrosquall commented 1 month ago

Sounds good, I think the fix is straightforward 👍 . The API appears to already have a setting to read the dimensions if the frontend provides them.

https://github.com/enjalot/latent-scope/pull/50