Open alexkiro opened 10 years ago
we're also seeing digest collisions from messages which have an attachment and no body text. the digester seems to use only the last line of base64/padding data ( "AAAAAA==" ).
according to the digest.py code comments, pyzor should hash the whole message if we have less than 4 (normalized) lines:
# If a message is this many lines or less, then we digest the whole
# message.
atomic_num_lines = 4
but handle_atomic
is only called on the normalized lines (when the "collision" already happened) . would it make sense to feed the full raw(undecoded) message body instead?
Another connected issue is the HTML parser used to strip HTML from messages. It isn't error tolerant, so it raises an exception on non-well-formatted HTML. This exception is then ignored by digest routine and an empty string is returned. What about using more simple, more efficient and much more robust code by Medeiros instead of HTMLParser?
def remove_html_markup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c
return out
What about using more simple, more efficient and much more robust code by Medeiros instead of HTMLParser?
It's a bit difficult to say what could or could not be useful in this situation. Because if the we use something simple like that it's possible that a lot of not relevant data will slip into the digest. And it could be easy to make that unique for different messages. This could also mean that a lot of different digest will get reported as spam, leading to a lot of False Positives.
On the other hand the empty string isn't useful at all, but won't cause any issues.
Could you give a example where this might be useful?
Would really love to see an update about this problem. Just activated pyzor and had a huge number of false positive (orders and bills as pdf attachement in empty mails, daily business at work), until I local_whitelist 'ed the digest da39a3ee5e6b4b0d3255bfef95601890afd80709
Pyzor really should not hit empty body mails. Normal End-Users will de-activate pyzor instead of whitelist due to false positives.
There are various situations where after normalization the message ends up empty. For example this happens when the message is short and/or only contains links.
In this case we would still want to attempt to create a unique signature for the messages. This is, however, difficult because we don't have too much to go on.