Closed lumitim closed 8 years ago
A general comment, because it's easier to leave it here than on any sort of line-by-line basis: it's a good idea to find a PEP8 checker and run files through it. (Disclosure: I'm still trying to get mine working as well as I want it to.)
In this case, doing so reveals whitespace at the end of a number of lines or in otherwise blank lines. It also catches a couple of meaningless style points which nevertheless are kind of nice to fix, for the sake of keeping code consistent and readable: missing spaces in i + 1
in line 159, underindentation in line 110, overindentation in line 79. (In that last case, I'd recommend putting parentheses around the two-part "and" condition, which will make the indenting correct, and will somewhat improve the readability, IMHO.)
(flymake, an on-the-fly syntax checker, also reveals an unused import on line 3.)
This whole operation would probably run orders of magnitude faster if it just used lumi_science
to manually run the parts of the pipeline that it actually cares about (tokenization and collocation-finding). It might not even want to tokenize; I wouldn't be surprised if it gets better results using the SpaceSplittingReader or the like, since spam is probably susceptible to that.
Just leaving this as food for thought...
Closing as obsolete due to code from Tim O that Dan already has elsewhere.
Aaaaaand already messed up with git.