code-kern-ai / refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
https://www.kern.ai
Apache License 2.0
1.39k stars 66 forks source link

Deactivate weak supervision labels below threshold #98

Open jhoetter opened 2 years ago

jhoetter commented 2 years ago

Is your feature request related to a problem? Please describe. We've just implemented the confidence distribution chart, which shows you how your weak supervision confidence is distributed amongst the records. Now, what I want to enable, is to deactivate weakly supervised labels based on user input in form of a threshold. E.g., if an entity has a confidence of only 20%, I don't want to use it in NER.

Describe the solution you'd like Not sure if this is a threshold we should inject during computation (e.g. when you execute the weak supervision), or whether it should be something you can regulate afterwards.

The latter one leaves more room for playing around with scores, whereas the one throwing away weakly supervised labels during computation based on a threshold is much easier to implement for now.

Describe alternatives you've considered Stating that this is not an alternative: just filtering in the data browser. Even though this works for classification cases, it isn't really an option for NER. In NER, you can have multiple entities per text, and thus you can have one super confident entity and one with ~10%-ish confidence.

Additional context In general, I very much believe that we should enable users to select their own choice for weak supervision synthesis. For instance, in NER I might want to rather exclude overlapping spans (i.e. find the intersection of heuristic labels), and in other cases of span labeling I want to find as much overlap as possible. If we provide an execution environment for weak supervision, we could easily integrate the option to filter below certain thresholds.

Could be combined with issue #57 .

JWittmeyer commented 1 year ago

With our current setup storing all weak supervision results will run into some problems (none however impossible to solve)

  1. Since gql is used to provide rlas on the labeling page and we have no option to filter for a min WS confidence the data needs to either be filtered on the frontend or filtering for gql needs to be implemented (which if i recall correctly resulted in some mayjor memory leaks with the last filter).
  2. All display options (e.g. a chard on overview) would need the option to filter the values. Or a global filter value needs to be stored.
  3. Export would increase in difficulty however a slice could work here as a workaround