SpamExperts / pyzor

Pyzor is a Python implementation of a spam-blocking networked system that use spam signatures to identify them.
GNU General Public License v2.0
138 stars 31 forks source link

Adjust the digest algorithm to get some unique data when the normalized message is empty #3

Open alexkiro opened 10 years ago

alexkiro commented 10 years ago

There are various situations where after normalization the message ends up empty. For example this happens when the message is short and/or only contains links.

In this case we would still want to attempt to create a unique signature for the messages. This is, however, difficult because we don't have too much to go on.

gryphius commented 10 years ago

we're also seeing digest collisions from messages which have an attachment and no body text. the digester seems to use only the last line of base64/padding data ( "AAAAAA==" ).

according to the digest.py code comments, pyzor should hash the whole message if we have less than 4 (normalized) lines:

    # If a message is this many lines or less, then we digest the whole
    # message.
    atomic_num_lines = 4

but handle_atomic is only called on the normalized lines (when the "collision" already happened) . would it make sense to feed the full raw(undecoded) message body instead?

tomas-mazak commented 9 years ago

Another connected issue is the HTML parser used to strip HTML from messages. It isn't error tolerant, so it raises an exception on non-well-formatted HTML. This exception is then ignored by digest routine and an empty string is returned. What about using more simple, more efficient and much more robust code by Medeiros instead of HTMLParser?

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out
alexkiro commented 9 years ago

What about using more simple, more efficient and much more robust code by Medeiros instead of HTMLParser?

It's a bit difficult to say what could or could not be useful in this situation. Because if the we use something simple like that it's possible that a lot of not relevant data will slip into the digest. And it could be easy to make that unique for different messages. This could also mean that a lot of different digest will get reported as spam, leading to a lot of False Positives.

On the other hand the empty string isn't useful at all, but won't cause any issues.

Could you give a example where this might be useful?

Schroeffu commented 6 years ago

Would really love to see an update about this problem. Just activated pyzor and had a huge number of false positive (orders and bills as pdf attachement in empty mails, daily business at work), until I local_whitelist 'ed the digest da39a3ee5e6b4b0d3255bfef95601890afd80709

Pyzor really should not hit empty body mails. Normal End-Users will de-activate pyzor instead of whitelist due to false positives.