code-kern-ai / refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
https://www.kern.ai
Apache License 2.0
1.39k stars 66 forks source link

Detecting precision of values in lookup list #254

Open jhoetter opened 1 year ago

jhoetter commented 1 year ago

Is your feature request related to a problem? Please describe. I love lookup lists, but i typically just have one lookuplist per label. When I collect all values in just one list, it can happen that a few values actually cause a bad performance. E.g. recently, I labeled the word "riot" to be negative, and then built a labeling function that looks for these words. because of "riot", i also hit words such as "patriot" (just an example).

In general, lookup lists don't always have a 100% precision. Adding words can make them worse, especially if they are very short and can be part of other words as well that have a different meaning.

Describe the solution you'd like When I add new values to a lookup list, I'd love to see how precise the association of the value to the given label actually is on item-level. For instance, I want to see that "riot" has a precision of 0.5 in my lookup list.

In general, I just want to have some help that tells me if an item in a lookup list shouldn't be in there.

Describe alternatives you've considered Just theoretically, I could add a labeling function for each and every item of the lookup list and thus calculate the stats. It's clear that I don't want to do that, especially because of the I/O.

What I could do, however, is to run an analysis on a lookup list on demand (e.g. when I actively request the calculation) that calculates the precision-stats for every item in a list individually, given that the label has the same name as the lookup list (or alternatively, i could enter the label stats i want to analyze). The computationally expensive part is not to run the stats individually, but it is to gather the data and put them in the containerized envs. So this should be possible afaik.

Additional context -