FelipeLuz / dotnet-bad-word-detector

.NET library that uses machine learning to detect bad words (profanity) within a string.
Apache License 2.0
14 stars 1 forks source link
bad-word bad-word-filter csharp csharp-library dotnet-core logistic-regression machine-learning profanity profanity-detection

.NET Bad Word Detector

This is a fast and robust library that detects offensive language within text strings. It currently supports only English language, more languages will be added soon.

How It Works

This library uses a logistic regression ML.NET model trained on thousands of human-labeled words. The trained model then was loaded as a resource for this lib and it is consulted on every new prediction.

Why to use this library?

Up to this moment all .NET profanity detection libraries use hard-coded lists of bad words to detect profanity, for instance, ProfanityDetector uses this list stored in memory, there are obvious glaring issues with this approach, and while they might be performant, these list based libraries are not comprehensive, they are easily outperformed by misspelling and by the human creativity to replace letters for meaningless chars creating new words that are perceived as curse words (e.g. house and h0us3).

Performance

In a single prediction this library was 618 times faster than the most downloaded .NET package for detecting profanity. For 100 successive predictions it was around 24 times faster.

Package 1 Prediction 10 Predicitons 100 predictions
.Net Bad Word Detector 0.0462 ms 1.5508 ms 4.1887 ms
ProfanityDetector 28.5823 ms 42.4606 ms 102.0750 ms

PC specs: Dell Inspiron 13, I7 8th gen, 16 GB.

How to install

dotnet add package DotnetBadWordDetector

How to use it

var detector = new ProfanityDetector();

if(detector.IsProfane("foo bar")){
    //do something
}

It is strongly suggested to keep the library always loaded in memory to increase its performance, it uses very little memory (less than 100 KB).

Accuracy, AUC and F1 score

Model quality metrics evaluation
--------------------------------
Accuracy: 98.43%
Auc: 99.49%
F1Score: 97.25%

Caveat

This library is not perfect, it is not 100% precise, and it is context-free, e.g. it can not detect profane phrases consisted of decent words. Also people diverge on what is considered profane.