Alpha4615 / gibberish-detection

A nodejs module to determine if a string seems to be gibberish.
MIT License
13 stars 1 forks source link

False positives #11

Open nabilfreeman opened 6 days ago

nabilfreeman commented 6 days ago

FYI, I tried implementing this to detect gibberish emails.

Using john.doe@gmail.com and extracting the text before the @ symbol...

john.doe is detected as gibberish by this library.

Alpha4615 commented 6 days ago

Hey there @nabilfreeman

Unfortunately, I am not sure how to best optimize this library for short strings, especially non-spaced strings that you would find as part of email addresses. Shorter strings will get you less accuracy just because the library just does not have enough volume of your test data to work with. There's just not a large enough variance to meet the shipped threshold of the default model.

For example, when I try to test john.doe using the default model, it sanitizes as John Doe which gives a score of 14947.42857 where as the minimum threshold value for the model is 15745.99. Because the score is less than the threshold, the library declares it as gibberish.

Testing "Hello, my name is John Doe" -- which of course is something you'd never see in an email address, returns an average score of 32883.708333, which is above that 15746 figure stated above, so it would be determined to be not gibberish.

I am really not sure how we could go about solving this. We could potentially have a different threshold that's more lenient that is applied when a string is shorter, but then you'll have the opposite issue where you have false negatives -- which could very well be tolerable since you just might be concerned with getting the majority of spam sign ups.

But generally, I think I'd advise using this library on username strings, especially emails. The tendency for users to have email addresses like XxSwimVBallFanxX[@aol.com] (which returns a score of 9449 with the current model) I think would lead to a lot of untestable strings that have a markov-chaining algorithm like this one.

I actually developed this library because a client was having issues with spambots registering in their application, and they'd sign up with first_name's and last_name's of random letters. After developing this and inserting it into the registration process, those registrations permanently stopped and I've yet to have any false positives to me. But, in that application, I used this test only on the first name and last name, not the email address, and I only rejected the registration if BOTH fields failed the gibberish test.

So my recommendation for you if you're testing for automated signups (I'm only presuming here) to test on the first name and last name and only react if both appear to be gibberish.

Let me know your thoughts and maybe we can figure something out! I might be willing to consider creating a "lenient threshold" for shorter test strings, but that still wouldn't likely solve your problem.