What is its CV accuracy on its own hand-tagged set?
If we run it on specific subreddits known to be hostile and known to be nice, does that pass sanity check.
How accurate is it when we go outside Reddit. (NYT comments, or Tweets.)
Try the various Alexis features to see which ones give best performance. Try the Veroncia Word Embedding stuff. Etc.