adityabansod / known-spam-emails

Tiny library to check email addresses against known spam lists
3 stars 2 forks source link

A few invalid email addresses #2

Open pdehaan opened 9 years ago

pdehaan commented 9 years ago

I have special eyes...

" salfi_mohamed@hotmail.com" doesn't look like a valid email address. (1149)
" chaw9i_kamal@hotmail.com" doesn't look like a valid email address. (1154)
" masoud_kamal@hotmail.com" doesn't look like a valid email address. (1156)
" bassam_ali75@hotmail.com" doesn't look like a valid email address. (1162)
" khatri.hammou@gmail.com" doesn't look like a valid email address. (1168)
" bibaa.manal@gmail.com" doesn't look like a valid email address. (1171)
" jabir.imad@live.com" doesn't look like a valid email address. (1179)
" wahbi.halim@hotmail.com" doesn't look like a valid email address. (1310)
"siham–shakira@hotmail.com" doesn't look like a valid email address. (1454)
"houria.chaji@hotmail.combasma_darif66@hotmail.com" doesn't look like a valid email address. (1549)
"m" doesn't look like a valid email address. (2301)
"wahid" doesn't look like a valid email address. (2325)

(Where the parenthesis'ed digits are approximate line numbers.)

And here's my magical linting code:

'use strict';

var fs = require('fs');
var path = require('path');

var isEmail = require('isemail');

var spamListDir = path.join(__dirname, 'spam_lists');

fs.readdir(spamListDir, function (err, files) {
  if (err) {
    throw err;
  }
  files.forEach(function (file) {
    var data = fs.readFileSync(path.join(spamListDir, file), 'utf8');
    var emails = data.split('\n');
    console.log('BEFORE: %d', emails.length);
    emails = emails.filter(function (email) {
      return !(/^(#|!|\n)/.test(email));
    });
    console.log('AFTER: %d', emails.length);
    emails.forEach(function (email, idx) {
      if (!isEmail(email)) {
        console.log('"%s" doesn\'t look like a valid email address. (%d)', email, idx);
      }
    });
  });
});

Note: You'll need to do a npm i isemail -D to install the isemail module.

There are a few interesting results:

  1. Some email addresses have leading/trailing whitespace (easy to fix, just use trim())
  2. Some aren't emails at all.
  3. One is missing a line break.

Obviously, all easy to fix locally, but if you're [manually] scraping this from the remote blogspot site, it may be moot (unless you can get it changed upstream).

pdehaan commented 9 years ago

I handled the email whitespace trim() problems in #3. Still not sure what to do about the invalid emails in the list. I can fix them, but not sure what your long-term strategy is for keeping your static files and the remote list in sync.

I guess technically you could add the isemail module into package.json and do something like this where you only add the [trimmed] email address to the known-good list if it is actually an email address (and ignore everything that isn't email address-esque):

default:
  email = email.trim();
  if (isEmail(email)) {
    lists.push(email);
  }
  break;

There is still no valid solution for "houria.chaji@hotmail.combasma_darif66@hotmail.com" since that looks bad in the source link.

I think you may be safer just blocking all emails using a variant of this regex /@hotmail\.com$/i.