chaoss / grimoirelab-perceval

Send Sir Perceval on a quest to retrieve and gather data from software repositories.
http://perceval.readthedocs.io/
GNU General Public License v3.0
290 stars 177 forks source link

mbox takes any sentence starting by `^From ` as a new message #20

Open gpoo opened 8 years ago

gpoo commented 8 years ago

The mbox backend parses anything starting with From as a new message. Therefore, the following message from OpenStack will be taken as 2 messages:

From eric at cloudscaling.com  Thu Aug  9 18:39:17 2012
From: eric at cloudscaling.com (Eric Windisch)
Date: Thu, 9 Aug 2012 14:39:17 -0400
Subject: [openstack-dev] [Openstack] Making the RPC backend a required
 configuration parameter
In-Reply-To: <5023F1C7.90205@redhat.com>
References: <606206A0BADB4E13BABB52D7E06E7804@cloudscaling.com>
 <5023F1C7.90205@redhat.com>
Message-ID: <ADD68378AEFB47FA89AA79B555F0F1A1@cloudscaling.com>

>  
> I also don't understand why having a default that doesn't work for
> anyone makes any sense.
>  
I would hope that a localhost only installation with a username and password of 'guest' include a very small number of anyones. Who is really using a completely stock, default configuration successfully, and do they really care?  Everyone else is using configuration management of sorts, at which point this discussion is moot. Even devstack changes this configuration.

If someone really is installing Nova from scratch and using a default configuration? Does setuptools also install RabbitMQ for you?  No?  Right, you need to read documentation and recognize that RabbitMQ needs to be installed.  Sure, once it is installed, no configuration is required; Unless you're /actually/ going to use it, of course.

From everything I've seen, the general recommendation on the mailing list for those installing Nova on a single node is to use devstack. In that case, the configuration is prompt-driven, and whatever changes need to be made, can be made.

Regards,
Eric Windisch

The main issue is that leaves the message truncated. In this example, the last paragraph and signature will be lost.

If the purpose is to only parse metadata, the approach is ok. Although there would not be reason to store the body of the message.

sduenas commented 8 years ago

Yes, this is a known bug (see db40e3a6184d31e5ef8ccc7d3c5cd6a9b973b813). This is an error that comes from the current implementation of mbox in the Python's standard library.

https://docs.python.org/3.5/library/mailbox.html#mbox https://github.com/python/cpython/blob/master/Lib/mailbox.py#L859

In mlstats we use another class to parse mbox files but it was removed in Python 3. I thought to copy its implementation to perceval following what you said here but it has a lot of dependencies with other modules, so I choose the fastest solution.

Any idea is welcome.

gpoo commented 8 years ago

The same issue that in 2.x series for mbox. I thought you had solved it here :-)

I think the actual issue is not checking if the next line is a header. That would reduce the odds of having issues. Although, still it may fail in a discussion about email, if somebody pastes header + body as part of the text ;-)

jgbarah commented 8 years ago

What about inspecting messages right after being identified as such by the mailbox library, looking for those mandatory headers? If those headers are not found, the message could be safely added to the body of the "previous" message, since it was not really a message.

I guess in the following code in perceval/backends/mbox.py

                 for message in self.parse_mbox(tmp_path):
                    if not self._validate_message(message):
                        continue
                     yield message

it would be a matter of, instead of continuing, adding to the body of the previous message.

Granted, this would force to defer the yield until the next message is analyzed, which could make the code a bit more complex...

If you want I can have it a try to see if I can produce a PR for this...

sduenas commented 8 years ago

What about inspecting messages right after being identified as such by the mailbox library, looking for those mandatory headers? If those headers are not found, the message could be safely added to the body of the "previous" message, since it was not really a message.

The problem is there are messages that do not include some mandatory headers, such as Message-ID, but they are messages. This means you'll be joining two messages in one.

Granted, this would force to defer the yield until the next message is analyzed, which could make the code a bit more complex... If you want I can have it a try to see if I can produce a PR for this...

That's my concern but if you find a clean way to do it, go ahead. Take into account that this is a problem related to the mbox format itself and for messages that are not multipart, so it would make sense to add this behaviour to the parser instead to the code you pasted.

gpoo commented 8 years ago

However, if the next line is not a header (^\w: something), we know that it is not a new message. If the next line looks like a header and it is not, well, you would discard it anyway with the current behavior.

It will fail in other cases, but not in the ones that people start a sentence with From.

And yes, the implementation of mbox in Python is naive, slow and bogus. It is unfortunate there is not a good mbox parser in Python like in other languages.

gpoo commented 8 years ago

I wonder how many mboxes out there do not have the fomat:

^From\s[some text]\s\s[Date]$

gpoo commented 8 years ago

I took all my compressed files to compare false positives.

$ find .mlstats/compressed -type f | wc -l
16194
$ find .mlstats/compressed -type f -exec zgrep '^From ' {} \; > from-sample.txt

From some reason I did not investigate, zgrep's output included some lines without '^From', so I cleaned them up:

$ egrep '^From ' from-sample.txt > from-sample-s.txt
$ wc  -l from-sample-s.txt
2604383 from-sample-s.txt

From there, I explored which lines did not end on a year and did not have a time. I removed the ones with @, nobody, MAILER-DAEMON and ' at ', because those were repetitive. And I checked them manually. I did not find a single line that would not be a beginning of email.

$ egrep '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' from-sample-s.txt | egrep -v '(@|nobody|MAILER-DAEMON| at )' | wc -l
454

So, I could say, that any line that does not match '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' is text. So, the next thing to check is how many false positives mbox will give:

$ egrep -v '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' from-sample-s.txt | wc -l
3186

Not that many, to be honest. But those would be the double, because it will truncate messages where these lines belong to.

Well, I narrowed the list to check it manually and I put them in a file errors.txt. I am pasting some of them:

$ sort errors.txt | uniq -c | sort -n
      6 From http://www.osnews.com/story.php?news_id=421&page=5 :
      7 From the prior art department: compare with RFC 2026.
      8 From 10.81.104.254 icmp_seq=1 Frag needed and DF set (mtu = 1450)
     10 From config.log:
     10 From the changelog:
     12 From Hagakure, The Book of the Samurai
     15 From 1 to the square root of 3        |    directly, unless asked |
     47 From Darkness, Lead me towards the Light,
     47 From Death, Lead me to Life Eternal
     47 From Untruth, lead me to the Truth,
     54 From WE to YOU, with love:

That said:

  1. If the line that does not match '^From .*[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' is potentially just text.
  2. To reduce the odds, the second line should not look like a header (something like '^[A-Za-z\-]: ')

You can check with your own datasets.

sduenas commented 8 years ago

Thanks @gpoo. I will try the same with some sets.

In the meantime, do you know any other good mbox parser in other languages, like C, Java, Perl whatever? I'm curious to know how they fix this issue because I think it's a problem more related to the mbox format itself than to the parser. If someone fixed this in other languages we can apply the same techniques or even send a patch to the standard library to fix it.