argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.83k stars 360 forks source link

[FEATURE] similarity search through query within the UI #2443

Open msminhas93 opened 1 year ago

msminhas93 commented 1 year ago

Is your feature request related to a problem? Please describe. Not having the capability via the UI to quickly perform an embedding search based on a text query typed in the search bar is limiting. This capability would make bulk annotation much more flexible since you could search for concepts via a custom text input query rather than a fixed sample from the dataset.

Describe the solution you'd like An option in the UI to allow for embedding search from the text query. This could be as a drop down having two option:

  1. word search
  2. embedding search
image
davidberenstein1957 commented 1 year ago

Hi @msminhas93 I would love to see the feature.

We need to fine-tune what we want to achieve. Users that do have the ability to actually get embeddings are able to do so via the python client, hence, they could also use rg.load("dataset", vector=embedding). However, it might be useful to allow for deploying an embedding model alongside Argilla to allow for this, like weaviate does here or elasticsearch 8.5 does here.

@frascuchon @dvsrepo IMO, this also aligns with https://github.com/argilla-io/argilla/issues/2150

@msminhas93 what would work best for you?

msminhas93 commented 1 year ago

Thank you for responding! I think the python client is awesome, but for rapid searches based on custom text inputs followed by bulk annotation with few deselections kind of workflow, having UI that supports embedding search would be extremely powerful. Also, domain experts can be nontechnical which would limit their capability to do such queries.

I would imagine this functionality similar to how the new search similar feature works. However, at the backend instead of just storing the embeddings, we store the encoder possibly as some kind of config. This could be as simple as the encoder name or an embed_text function or method (that has to subclass some default base with certain other housekeeping things) that accepts text as input and returns embeddings.

image

So when we press enter and the embedding search is enabled the callback will run the same logic as the find similar method but with the encoded input text vector.

An additional slider or UI component to filter the similarity score based on the input threshold would be useful too.

davidberenstein1957 commented 1 year ago

@msminhas93 Thanks

An additional slider or UI component to filter the similarity score based on the input threshold would be useful too.

great suggestion! Could you mention that suggestion here too?

davidberenstein1957 commented 1 year ago

@msminhas93 better still could you add a UI specific issue for this and tag @Amelie-V ?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 30 days since being marked as stale.

davidberenstein1957 commented 8 months ago

Revisited some old issues as proposed by Damien Tanner.

davidberenstein1957 commented 7 months ago

Potentially use BM25 as proposed here https://github.com/argilla-io/argilla/issues/2150

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 90 days with no activity.