Improve benchmarking and performance measurements

The benchmark suite and evaluation tools haven't been used in a while, and it would be nice to have both run-time timing results and performance metrics taken as part of the chatter release process so we can look back over time and see if / how the classifiers change as we tweak the implementations and change training data.

This ticket is to build the infrastructure so that it's easy to add a new classifier for an existing task (eg: POS tagging, Chunking) as well as add new tasks (eg: Named Entity Recognition) and generate clear results that show false positives, false negatives, and true positives in a way that matches the behavior of NLTK (for a clear point of comparison -- someone should be able to roughly compare chatter result numbers with other toolkits; I feel no particular attachment to NLTKs evaluation details, but I see no reason to invent our own).

creswick / chatter

Improve benchmarking and performance measurements #13