Dataset used for training the model is poor

dimitrismistriotis commented 2 years ago

Copy from original https://gitlab.com/dimitrios/alt-profanity-check/-/issues/11.

The builtin dataset is really poor. Much of the stuff marked as profanity is just people politely disagreeing with each other. Other stuff is extremely inoffensive sorts of comments (eg. "lame", "slow", "sick", "weird", "Nazi", "loser") that most people would not consider profanity. The dataset is also littered with wikipedia-isms, users handles, etc. This causes weird side effects like certain names (eg. Beth) being flagged as profanity. If an uncommon word happens to appear in a phrase marked as profanity, that uncommon word is wrongly penalized. For this reason, phrases marked as profanity ought to be honed down just to their crucial profane elements -- that or a much larger dataset used.

Of course the dataset is the secret sauce of the whole thing, and I'm not sure how a better one would be found. It would take endless hours to try to clean this one up. I gather this dataset was originally chosen as a sort of easy source for an initial PoC, but if this project wanted to become more polished and useful it would need to find a better one.

Dimitrios:

Hi Jonathon.

Thanks a lot for your time to write this, got me into some thinking, will also discuss it with @koti3 .

Our initial intention was to provide a drop-in replacement of the original package and update it always with the latest version of the libraries it depends on. Better accuracy/performance was out of our scope. Later on we decided to add some functionality that would make our lives easier and also make it a "proper" open source library such as automate the package generation with twine, black format the whole project, things like that. We are moving the development to Github - https://github.com/dimitrismistriotis/alt-profanity-check - and will add some utilities such as Github actions for testing and release to Pypi.

You got me thinking though on if we should (a) clean up the input data, and (b) expand the input dataset.

As I see it now based on your input this would not break the original "promise", not flagging a first name because the original implementation flagged it is not wrong although in technical terms not backward compatible but who cares.

With that in mind we'll try to move to Github as soon as possible and then discuss how to move on. One idea would be to remove the most common first names from the dataset as a beginning by creating a pipeline (maybe with Pyspark or something similar) and then add to that.

Any thoughts?

Menelaos:

Hi @jsilver33,

So many thanks for your comment and taking the time to give us some constructive feedback and share your opinion with us.

What @dimitrios mentioned above is 100% correct. We tried to keep the project alive by following and retraining with the latest versions of the libraries provided mainly to serve our personal technical aspirations and use it as an initiative to help us get better as software engineers.

Although we sometimes discussed further improving or adding new features we never did. As I see it, enhancing it would require a significant amount of time, which I'm happy to do since the impact and the added value it brings to the open source community is admittedly high. Also, I'm not sure if adding new data would require financial resources as well so that they can be evaluated through a crowdsourcing platform.

Just sharing my thoughts, would like to continue this conversation so that I get yours as well.

Jonathon:

Thanks both for your replies and interest in improving this project.

Personally, I started off with using a bunch of regex to clean the dataset (eg. stripping out specific wikipedia syntax and usernames, uncensoring swear words, removing a few common phrases).

But then I noticed that the scoring made by Wikipedia admins didn't really match how I wish to score for "profanity" in my own application. Lots of the stuff they marked as bad was really just people politely disagreeing with them. So I started along the route of rescoring the data: manually looking at all comments that didn't contain any of a large list of profane words and changing 1's to 0's if they were not actually profane in my opinion. In an hour I managed to process about 10% of the file, and then I ran out of time to work on it.

Even this would not touch on the problem of innocent words being included in the bad lines. I ran my own small dataset against the model and identified a couple dozen words being caught that shouldn't have been. I searched and removed some of those words from the bad lines. But that wouldn't prevent my users from encountering more such words in the future. I suppose one could run a whole dictionary of words through the model to try to catch innocent words being penalized. Regex could possibly be used to strip innocent words out of bad lines? Stripping them out is probably best, but it would be a fine balance to do that without disturbing any contextual data in those bad lines that is actually important to the model (ie. taken to the extreme you could just go all the way down to a small list of profane words). But I suppose any innocent words that aren't well featured in the good lines, also should not be used in bad lines.

Ultimately, it would be a massive task to clean up this dataset. I don't know if it's worthwhile or if there are better datasets out there already, or where you'd find them.... That's all my thoughts, as someone who just looked into this a couple days ago & not yet sure if it will make it into production.

Dimitrios:

Although have not thoroughly checked I agree with the observation.

Lots of the stuff they marked as bad was really just people politely disagreeing with them.

is something that is becoming more and more common.

Do not know if it would be manageable checking all the dataset for which I would opt for fixes that can be automated even if sub-optimal, like removal of all first names. Manual edits should be welcome. We have to decide then if two versions of the models should be trained or if only one with the amended data - which seems more reasonable. Either way once we integrate with Github fully - we are close, we can start checking this.

dimitrismistriotis commented 2 years ago

Continuing here.

dimitrismistriotis commented 2 years ago

Also from Peter Willemsen:

Hey there,

When using just the word 'eat', it is being triggered as offensive.

>>> from profanity_check import predict, predict_prob
>>> predict(['eat'])
array([1])

I used alt-profanity-check 1.1.1

peterwilli commented 2 years ago

So retraining the model fixes everything? I currently have a workaround I call the profanity filter filter:

def text_is_unsafe(text) -> bool:
    # Some words trigger false-positives like 'eat', 'lame' etc.
    # Instead of retraining the whole model I opted for simply ignoring them.
    # See https://gitlab.com/dimitrios/alt-profanity-check/-/issues/12
    filter_words = [
        'eat',
        'lame',
        'loser'
    ]
    for word in filter_words:
        text = text.replace(word, '')
    return predict_profanity([text])[0] == 1

dimitrismistriotis commented 2 years ago

Ping: is @jsilver33 the same person here as in Gitlab?

dimitrismistriotis commented 2 years ago

This is what should be avoided although necessary for some applications at this point.

We should train the model with additional and better data or fix the current dataset.

So retraining the model fixes everything? I currently have a workaround I call the profanity filter filter:

Jm15itch commented 1 year ago

We could get better data from public data such as: https://hatespeechdata.com/#English-header

Each database has roughly 25000~ each I would say, maybe we could combine them by combining similar tags together. Like some of them tag items as [(Personal attack, Not)] while others use something like [Multi-thematic (Abusive, Hateful, Normal, Spam)].

menkotoglou commented 1 year ago

Hey, @Jm15itch and thank you for your interest in the project.

Profane doesn’t necessarily mean hateful, so we’re unsure if these datasets would best serve our needs here. Have not checked thoroughly but do you find a subset in this that applies only to profanity?

dimitrismistriotis / alt-profanity-check

Dataset used for training the model is poor #12