Open canasdiaz opened 10 years ago
It seems there's a problem parsing the file 2007-February.txt from metrics-grimoire mailinglists. When mlstats parses it, loops forever.
Dave Neaty report the same error almost two years ago...
Maybe related to #1
mlstats does not handle correctly attachments, or for that matter it does not handle MIME objects.
IIUC, some points of the a rant on mime parsers in http://jeffreystedfast.blogspot.ca/2013/09/time-for-rant-on-mime-parsers.html applies to mlstats (no, mlstats is not the target of the rant, but some things seems to apply the way mlstats parses mbox files).
I have a branch where this problem is partially solved. The diff is here: https://github.com/gpoo/MailingListStats/commit/ad87c8f16bde7c2b6940389a6faad1c640cf93cf
and the branch is https://github.com/gpoo/MailingListStats/tree/strictmbox
I said partially because in some messages my branch might not consider an extra (empty) line that is in the message. I have not looked in detail, and I wrote it a couple of months ago to remember :-)
FWIW, in the source code of mailbox, with respect to the old classes, there is the following comment:
# This algorithm, and the way it interacts with _search_start() and
# _search_end() may not be completely correct, because it doesn't check
# that the two characters preceding "From " are \n\n or the beginning of
# the file. Fixing this would require a more extensive rewrite than is
# necessary. For convenience, we've added a PortableUnixMailbox class
# which does no checking of the format of the 'From' line.
Even though the algorithm changed later, I don't think it changed in a way that solves this issue. Just to keep it in mind.
$ python mlstats --db-user=root --db-password=root --db-name=mlstats_innodb --db-admin-user=root --db-admin-password=root https://lists.libresoft.es/pipermail/metrics-grimoire/ &> report.log