code-kern-ai / refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
https://www.kern.ai
Apache License 2.0
1.39k stars 66 forks source link

Suggesting lookup list entries #255

Open jhoetter opened 1 year ago

jhoetter commented 1 year ago

Is your feature request related to a problem? Please describe. I want to quickly extend my lookup lists with further values, and want to find further values of records that I didn't even label yet.

Describe the solution you'd like With a token-based embedding, we should be able to compute n-grams (see below for more context) and compute similarity search based on the entries we already have. That way, we could find synonyms etc. from the corpus we have at hand, which could be super helpful.

Again, this could be something that is actively requested by pressing a button in the lookup list, which then goes on and does the similarity search and creates suggestions.

Describe alternatives you've considered -

Additional context Google search for n-grams

An n-gram is a sequence. n-gram. of n words: a 2-gram (which we'll call bigram) is a two-word sequence of words. like “please turn”, “turn your”, or ”your homework”, and a 3-gram (a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.