MetricsGrimoire / MailingListStats

Mailing List Stats is a command line based tool used to analyze mboxes
http://metricsgrimoire.github.com/MailingListStats/
GNU General Public License v2.0
38 stars 25 forks source link

Memory error #10

Open canasdiaz opened 10 years ago

canasdiaz commented 10 years ago

$ python mlstats --db-user=root --db-password=root --db-name=mlstats_innodb --db-admin-user=root --db-admin-password=root https://lists.libresoft.es/pipermail/metrics-grimoire/ &> report.log

Traceback (most recent call last):
  File "mlstats", line 37, in <module>
    pymlstats.start()
  File "/home/luis/repos/MailingListStats/pymlstats/__init__.py", line 154, in start
    web_user, web_password)
  File "/home/luis/repos/MailingListStats/pymlstats/main.py", line 145, in __init__
    t,s,np = self.__analyze_mailing_list(mailing_list)
  File "/home/luis/repos/MailingListStats/pymlstats/main.py", line 298, in __analyze_mailing_list
    total, stored, non_parsed = self.__analyze_list_of_files(mailing_list, archives_to_analyze)
  File "/home/luis/repos/MailingListStats/pymlstats/main.py", line 451, in __analyze_list_of_files
    messages, non_parsed_messages = self.mail_parser.get_messages()
  File "/home/luis/repos/MailingListStats/pymlstats/analyzer.py", line 166, in get_messages
    filtered_message['body'])
  File "/home/luis/repos/MailingListStats/pymlstats/analyzer.py", line 265, in make_msgid
    m = hashlib.md5(message.encode('utf-8')).hexdigest()
MemoryError
sduenas commented 10 years ago

It seems there's a problem parsing the file 2007-February.txt from metrics-grimoire mailinglists. When mlstats parses it, loops forever.

sduenas commented 10 years ago

Dave Neaty report the same error almost two years ago...

https://bugzilla.libresoft.es/show_bug.cgi?id=325

gpoo commented 10 years ago

Maybe related to #1

mlstats does not handle correctly attachments, or for that matter it does not handle MIME objects.

IIUC, some points of the a rant on mime parsers in http://jeffreystedfast.blogspot.ca/2013/09/time-for-rant-on-mime-parsers.html applies to mlstats (no, mlstats is not the target of the rant, but some things seems to apply the way mlstats parses mbox files).

gpoo commented 10 years ago

I have a branch where this problem is partially solved. The diff is here: https://github.com/gpoo/MailingListStats/commit/ad87c8f16bde7c2b6940389a6faad1c640cf93cf

and the branch is https://github.com/gpoo/MailingListStats/tree/strictmbox

I said partially because in some messages my branch might not consider an extra (empty) line that is in the message. I have not looked in detail, and I wrote it a couple of months ago to remember :-)

gpoo commented 10 years ago

FWIW, in the source code of mailbox, with respect to the old classes, there is the following comment:

# This algorithm, and the way it interacts with _search_start() and
# _search_end() may not be completely correct, because it doesn't check
# that the two characters preceding "From " are \n\n or the beginning of
# the file.  Fixing this would require a more extensive rewrite than is
# necessary.  For convenience, we've added a PortableUnixMailbox class
# which does no checking of the format of the 'From' line.

Even though the algorithm changed later, I don't think it changed in a way that solves this issue. Just to keep it in mind.