Simple spam filter (Naive Bayes)

GeorgeGardiner commented 3 years ago

I'm hoping to implement a simple email spam filter type thing using RubixML + Naive Bayes but the only example I can find that deals with text / bag-of-words is the IMDB sentiment analysis example which is a pretty tough introduction!

Would somebody be able to help me with the simpler case?

I'm thinking the Enron email dataset could make this a nice example project, and I'm more than happy to create it if someone could assist me with the nuts and bolts of creating a dataset using the email body text as samples.

https://www.kaggle.com/wanderfj/enron-spam

andrewdalpino commented 3 years ago

Hey @GeorgeGardiner great question ... summarizing our convo from the Telegram Channel (https://t.me/RubixML) ...

You can use the transformer pipeline from the Sentiment example with Gaussian Naive Bayes under the hood instead of the more complex MLP if you want. The code would look something like this ...

use Rubix\ML\Pipeline;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Transformers\ZScaleStandardizer;
use Rubix\ML\Classifiers\GaussianNB;

$estimator = new Pipeline([
    new TextNormalizer(),
    new WordCountVectorizer(10000, 2, 10000, new NGram(1, 2)),
    new TfIdfTransformer(),
    new ZScaleStandardizer(),
], new GaussianNB());

I'm not certain if you need Z Scale Standardizer, but I left it in as this may help shape the data into something more Gaussian-like. I would use cross-validation to determine if it's needed. Note that you can also try BM25-weighted term counts instead of TF-IDF using the BM25 Transformer in the Extras Package.

The Enron spam dataset (the pre-labeled one) seems like a great dataset to practice with.

Here's the link to the article posted in the channel for reference ...

https://towardsdatascience.com/how-to-build-and-apply-naive-bayes-classification-for-spam-filtering-2b8d3308501

Let me know if you had any more questions!

blaaat commented 3 years ago

Hi @andrewdalpino, I used your code, thanks!

The predict method seems to work great, but the proba method just returns 0.0 for each label. I'm new to rubix ml, but it appears that https://github.com/RubixML/ML/blob/95ec40d2a5925c39ad8c23dd5668aa77f9a78aa7/src/functions.php#L50 returns INF (due to very high and low results of jointLogLikelihood)

Is this expected? Is there a solution?

andrey-helldar commented 1 year ago

Hello!

I have a problem detecting spam and obscene in messages.

Perhaps I'm not good at searching for the information I need on machine learning. Please tell me what I need to read to implement this idea.

Situation:

I am a web developer chat administrator and I want to train a neural network to detect obscene language and spam (at the moment we have 5 categories: spam, flood, obscene, toxicity and meaningless messages).

Each of the categories is trained and tested separately, so I took the category of curses, since it has the most messages:

Is Obscene - 662 Is Not Obscene - 3599

I train the network like this (I use Laravel Framework, but to shorten the code I will post the functionality used):

namespace App\Neural\Datasets;

use Rubix\ML\Datasets\Labeled as BaseLabeled;

class Labeled extends BaseLabeled
{
    public function push(mixed $label, mixed $sample): void
    {
        $this->labels[]  = $label;
        $this->samples[] = is_array($sample) ? $sample : [$sample];
    }
}

<?php

declare(strict_types=1);

namespace App\Neural\Estimators;

use App\Models\Message;
use App\Neural\Datasets;
use App\Neural\Transformers\TextNormalizer;
use Illuminate\Database\Eloquent\Collection;
use Rubix\ML\Classifiers\MultilayerPerceptron;
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Estimator;
use Rubix\ML\NeuralNet\ActivationFunctions\LeakyReLU;
use Rubix\ML\NeuralNet\Layers\Activation;
use Rubix\ML\NeuralNet\Layers\BatchNorm;
use Rubix\ML\NeuralNet\Layers\Dense;
use Rubix\ML\NeuralNet\Layers\PReLU;
use Rubix\ML\NeuralNet\Optimizers\AdaMax;
use Rubix\ML\PersistentModel;
use Rubix\ML\Persisters\Filesystem;
use Rubix\ML\Pipeline;
use Rubix\ML\Tokenizers\NGram;
use Rubix\ML\Transformers\BM25Transformer;
use Rubix\ML\Transformers\RegexFilter;
use Rubix\ML\Transformers\WordCountVectorizer;

class Obscene
{
    protected string $labelPositive = 'positive';

    protected string $labelNegative = 'negative';

    public function isNegative(string $text): bool
    {
        $model = PersistentModel::load($this->filesystem());

        return in_array($this->labelNegative, $model->predict(new Unlabeled([$text])));
    }

    public function train(): void
    {
        $estimator = $this->estimator();

        $estimator->train($this->getDataset());

        $estimator->save();
    }

    protected function getDataset(): Labeled
    {
        $label = new Datasets\Labeled();

        Message::query()->chunk(1000, fn (Collection $items) => $items->each(function (Message $message) use (&$label) {
            $type = $message->is_obscene ? $this->labelNegative : $this->labelPositive;

            $label->push($type, $message->text);
        }));

        return $label;
    }

    protected function estimator(): Estimator
    {
        return new PersistentModel(
            new Pipeline([
                new RegexFilter([
                    RegexFilter::GRUBER_1,
                    RegexFilter::GRUBER_2,
                    RegexFilter::EMAIL,
                ]),
                new TextNormalizer(),
                new WordCountVectorizer(10000, 1, 0.4, new NGram(1, 2)),
                new BM25Transformer(),
            ], new MultilayerPerceptron([
                new Dense(100),
                new Activation(new LeakyReLU()),
                new Dense(100),
                new Activation(new LeakyReLU()),
                new Dense(100, 0.0, false),
                new BatchNorm(),
                new Activation(new LeakyReLU()),
                new Dense(50),
                new PReLU(),
                new Dense(50),
                new PReLU(),
            ], 256, new AdaMax(0.0001))),
            $this->filesystem()
        );
    }

    protected function filesystem(): Filesystem
    {
        return new Filesystem(__DIR__ . '/obscene.model', true);
    }
}

And using:

use App\Neural\Estimators\Obscene;

$network = new Obscene();

$network->train();

echo $network->isNegative('что-то непонятное') ? 'is obscene' : 'is not obscene';
// is obscene, tun it's a lie, because "что-то непонятное" on English - "something strange"

We have a Russian-language chat of web developers and training takes place on it.

I look through each message and turn on the "tick" of the category on each - the neural network is trained on this data.

The second problem is that the training of each category takes about 20-22 minutes. Is that too long?

What can you advise to read about this?

PS: I tried using GaussianNB. Training on it lasts about 8-10 seconds, but after each it either always shows TRUE or FALSE on any text on the same amount of data. And I don't understand how I can force the neural network to detect such messages.

UPD: I think I found a possible solution to the problem:

Created a class for converting characters to Latin;
Replaced MultilayerPerceptron with ClassificationTree.

Now, basically, the network correctly detects swearing, flooding and spam.

protected function estimator(): Estimator
{
    return new PersistentModel(
        new Pipeline([
            new RegexFilter([
                RegexFilter::GRUBER_1,
                RegexFilter::GRUBER_2,
                RegexFilter::EMAIL,
            ]),
            new TextNormalizer(),
            new CharsConverter(), // voku\helper\ASCII::to_ascii($value);
            new WordCountVectorizer(10000, 1, 0.8, new NGram(1, 2)),
            new BM25Transformer(),
        ], new ClassificationTree()),
        $this->filesystem()
    );
}

Time of training:

RubixML / ML

Simple spam filter (Naive Bayes) #148