Explore BYOD library - Githubissues

dcecchini commented 1 year ago

Explode the BYOD repository for additional tests or datasets to add to nlptest.

Examples:

Should we add a self-evaluator test?
Should we add a dictionary based toxicity model (search for English or other language predefined terms that indicate toxicity)?
Anything else?

JulesBelveze commented 1 year ago

My personal takeaways:

Really interesting to use a similarity metric and an invariance score (similar to what we talked about @dcecchini @ArshaanNazir )
The tests they set up are quite childish (e.g. for 'word ordering' they simply swap two random words of the text)
The toxic approach and the metric they use is interesting: regardless the input of the LLM (containing toxic words or not) you don't want the model to output any toxic word
We could add a "broken tokenization" test
Really like their "radar" chart

Even though there's nothing ground breaking in the repo and paper I do think it is really interesting to have an approach in which the model is evaluated against itself.

dcecchini commented 1 year ago

I agree, some of the tests are very simple, but also easy to implement and fast to run. So maybe we could add like the toxicity one for a quick test without any dependency to an external library to run an ML model...

Let's make a list of what is worth to bring to nlptest and add them to the roadmap.

dcecchini commented 1 year ago

I just found a paper about self evaluating, would interesting to read and check if we can implement it.

https://arxiv.org/abs/2306.13651?utm_source=substack&utm_medium=email

JohnSnowLabs / langtest

Explore BYOD library #560