MetricsGrimoire / MailingListStats

Mailing List Stats is a command line based tool used to analyze mboxes
http://metricsgrimoire.github.com/MailingListStats/
GNU General Public License v2.0
38 stars 25 forks source link

Feature Request: Bad dates shouldn't trigger a fatal exception that halts mlstats #63

Closed geekygirldawn closed 8 years ago

geekygirldawn commented 8 years ago

Is there any way we could trap the error for badly formatted dates, instead of dumping out of mlstats with a python error? :)

Maybe check to see if it's a valid date

Example:

Analyzing /home/dawn/.mlstats/compressed/dir.gmane.org/gmane.linux.network/0
Traceback (most recent call last):
  File "/usr/local/bin/mlstats", line 4, in <module>
    __import__('pkg_resources').run_script('mlstats==0.4', 'mlstats')
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 724, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1657, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/mlstats-0.4-py2.7.egg/EGG-INFO/scripts/mlstats", line 38, in <module>

  File "build/bdist.linux-x86_64/egg/pymlstats/__init__.py", line 175, in start
  File "build/bdist.linux-x86_64/egg/pymlstats/main.py", line 178, in __init__
  File "build/bdist.linux-x86_64/egg/pymlstats/main.py", line 230, in __analyze_mailing_list
  File "build/bdist.linux-x86_64/egg/pymlstats/main.py", line 399, in __analyze_list_of_files
  File "build/bdist.linux-x86_64/egg/pymlstats/db/session.py", line 163, in store_messages
  File "build/bdist.linux-x86_64/egg/pymlstats/db/session.py", line 129, in insert_messages
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 801, in commit
    self.transaction.commit()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 392, in commit
    self._prepare_impl()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 372, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2019, in flush
    self._flush(objects)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2137, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2101, in _flush
    flush_context.execute()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
    rec.execute(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
    uow
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj
    mapper, table, insert)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py", line 767, in _emit_insert_statements
    execute(statement, multiparams)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context) 
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1344, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
    context) 
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 450, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 159, in execute
    query = query % db.literal(args)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 264, in literal
    return self.escape(o, self.encoders)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/times.py", line 87, in DateTime2literal
    return string_literal(format_TIMESTAMP(d),c)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/times.py", line 41, in format_TIMESTAMP
    return d.strftime("%Y-%m-%d %H:%M:%S")
ValueError: year=102 is before 1900; the datetime strftime() methods require year >= 1900
gpoo commented 8 years ago

Do you know which message in particular triggers the issue?

geekygirldawn commented 8 years ago

There were actually 15 messages with a bad date (year 0120). I temporarily fixed this by uncompressing the file, doing a search / replace in vi (correct date was 2002), re-compressing the file, re-running mlstats and dropping in the new file after mlstats downloaded the bad file but before the analysis kicked in.

Related: is there a way to tell mlstats to not download the files again but just do the analysis on the previously downloaded files?

On Mar 1, 2016, at 19:26, Germán Poo-Caamaño notifications@github.com wrote:

Do you know which message in particular triggers the issue?

— Reply to this email directly or view it on GitHub.

gpoo commented 8 years ago

It would be great if you could identify the message (or a range). Then I could try to take a look at the message in gmane (the bandwitth in this coffee shop does not allow me to retrieve the full archive).

Regarding to your question. I thought there was an issue opened for that or I fixed that somewhere (or maybe no :-)

Anyhow, you can try:

$ python pymlstats/analyzer.py /path/to/mbox

geekygirldawn commented 8 years ago

Sorry, I just got back to a computer where I could ssh into my server. I thought I had a saved copy of the bad file, but it looks like I overwrote it when I was in a hurry to kick off another run before I rushed out the door for a meetup :(

So far, it's still running, which is great!

gpoo commented 8 years ago

I found a couple: http://download.gmane.org/gmane.linux.network/272/274 They look like spam.

That said, I noticed that the limitation is in the MySQLdb package because it uses strftime instead of isotime. The former handle dates starting from 1900, the later does not have such limitation.

The issue is not triggered using Sqlite (I have not checked Postgresql).

I think we can verify if the date is "valid"... just because of MySQL. I think I won't convince you (nor everybody else) to not use MySQL ;-)

gpoo commented 8 years ago

Commit https://github.com/MetricsGrimoire/MailingListStats/commit/3375536629e576b26a1d5a44eef8612026708c3c should fix the issue with the date.

I noticed that the condition was not totally right.

That said, I do think that this is a bug in the MySQL module, because it does not allow to store datetime below 1900, which does not make sense if someone wants to store historical data.