Open gpoo opened 8 years ago
Yes, this is a known bug (see db40e3a6184d31e5ef8ccc7d3c5cd6a9b973b813). This is an error that comes from the current implementation of mbox
in the Python's standard library.
https://docs.python.org/3.5/library/mailbox.html#mbox https://github.com/python/cpython/blob/master/Lib/mailbox.py#L859
In mlstats
we use another class to parse mbox files but it was removed in Python 3. I thought to copy its implementation to perceval
following what you said here but it has a lot of dependencies with other modules, so I choose the fastest solution.
Any idea is welcome.
The same issue that in 2.x series for mbox. I thought you had solved it here :-)
I think the actual issue is not checking if the next line is a header. That would reduce the odds of having issues. Although, still it may fail in a discussion about email, if somebody pastes header + body as part of the text ;-)
What about inspecting messages right after being identified as such by the mailbox library, looking for those mandatory headers? If those headers are not found, the message could be safely added to the body of the "previous" message, since it was not really a message.
I guess in the following code in perceval/backends/mbox.py
for message in self.parse_mbox(tmp_path):
if not self._validate_message(message):
continue
yield message
it would be a matter of, instead of continuing, adding to the body of the previous message.
Granted, this would force to defer the yield until the next message is analyzed, which could make the code a bit more complex...
If you want I can have it a try to see if I can produce a PR for this...
What about inspecting messages right after being identified as such by the mailbox library, looking for those mandatory headers? If those headers are not found, the message could be safely added to the body of the "previous" message, since it was not really a message.
The problem is there are messages that do not include some mandatory headers, such as Message-ID
, but they are messages. This means you'll be joining two messages in one.
Granted, this would force to defer the yield until the next message is analyzed, which could make the code a bit more complex... If you want I can have it a try to see if I can produce a PR for this...
That's my concern but if you find a clean way to do it, go ahead. Take into account that this is a problem related to the mbox
format itself and for messages that are not multipart, so it would make sense to add this behaviour to the parser instead to the code you pasted.
However, if the next line is not a header (^\w: something
), we know that it is not a new message. If the next line looks like a header and it is not, well, you would discard it anyway with the current behavior.
It will fail in other cases, but not in the ones that people start a sentence with From
.
And yes, the implementation of mbox in Python is naive, slow and bogus. It is unfortunate there is not a good mbox parser in Python like in other languages.
I wonder how many mboxes out there do not have the fomat:
^From\s[some text]\s\s[Date]$
I took all my compressed files to compare false positives.
$ find .mlstats/compressed -type f | wc -l
16194
$ find .mlstats/compressed -type f -exec zgrep '^From ' {} \; > from-sample.txt
From some reason I did not investigate, zgrep's output included some lines without '^From'
, so I cleaned them up:
$ egrep '^From ' from-sample.txt > from-sample-s.txt
$ wc -l from-sample-s.txt
2604383 from-sample-s.txt
From there, I explored which lines did not end on a year and did not have a time. I removed the ones with @, nobody, MAILER-DAEMON and ' at ', because those were repetitive. And I checked them manually. I did not find a single line that would not be a beginning of email.
$ egrep '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' from-sample-s.txt | egrep -v '(@|nobody|MAILER-DAEMON| at )' | wc -l
454
So, I could say, that any line that does not match '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$'
is text. So, the next thing to check is how many false positives mbox
will give:
$ egrep -v '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' from-sample-s.txt | wc -l
3186
Not that many, to be honest. But those would be the double, because it will truncate messages where these lines belong to.
Well, I narrowed the list to check it manually and I put them in a file errors.txt
. I am pasting some of them:
$ sort errors.txt | uniq -c | sort -n
6 From http://www.osnews.com/story.php?news_id=421&page=5 :
7 From the prior art department: compare with RFC 2026.
8 From 10.81.104.254 icmp_seq=1 Frag needed and DF set (mtu = 1450)
10 From config.log:
10 From the changelog:
12 From Hagakure, The Book of the Samurai
15 From 1 to the square root of 3 | directly, unless asked |
47 From Darkness, Lead me towards the Light,
47 From Death, Lead me to Life Eternal
47 From Untruth, lead me to the Truth,
54 From WE to YOU, with love:
That said:
'^From .*[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$'
is potentially just text.'^[A-Za-z\-]: '
)You can check with your own datasets.
Thanks @gpoo. I will try the same with some sets.
In the meantime, do you know any other good mbox parser in other languages, like C, Java, Perl whatever? I'm curious to know how they fix this issue because I think it's a problem more related to the mbox format itself than to the parser. If someone fixed this in other languages we can apply the same techniques or even send a patch to the standard library to fix it.
The mbox backend parses anything starting with
From
as a new message. Therefore, the following message from OpenStack will be taken as 2 messages:The main issue is that leaves the message truncated. In this example, the last paragraph and signature will be lost.
If the purpose is to only parse metadata, the approach is ok. Although there would not be reason to store the body of the message.