KockaAdmiralac / KockaLogger

Parses IRC logs of activity across Fandom, then relays it into a Discord channel, searches for spam/vandalism and more.
GNU General Public License v3.0
8 stars 9 forks source link

newusers: Spam prediction #64

Open KockaAdmiralac opened 1 year ago

KockaAdmiralac commented 1 year ago

Description

Now that the profile classification results are stored in a database, we can use them as a dataset for a machine learning model that can predict whether a profile is spam or not. We can use prediction results to mark profiles with high probability of being spam, and when it receives high enough accuracy (or whatever other metric we decide to look at) use it to auto-report spam profiles to SOAP.

Proposed solution

This task is for tracking the initial implementation of a machine learning model which can be trained on the existent database and achieve good enough results. The procedure is as follows:

  1. Data Collection: Wait for the dataset to grow large enough. As of writing this, there are about ~1000 classified spam profiles and ~6000 classified non-spam profiles, and the system has been running since August 17 (25 days), which is 40 spam profiles per day and 240 non-spam profiles per day. At this rate, there should be about 10000 profiles in about a year. (I'm not sure if we really need to wait that long.)
  2. Feature Extraction: Decide which features from the dataset to use in the model. Regardless of whether we use a neural network for the model or not, most of the profile data we have is in string form which somehow needs to be transformed before being fed into the model.
  3. Training: Create a model and train it on the dataset. Try several different approaches and parameters and see which work best.
  4. Integration: Load the trained model into KockaLogger and show prediction results in the reports channel, putting a mark on those predicted as likely spam (above a certain threshold of certainty.

Notes

I'm not that skilled in machine learning at the time of writing this issue.