ContinuumIO / topik

A Topic Modeling toolbox
BSD 3-Clause "New" or "Revised" License
93 stars 24 forks source link

Various fixes + logging + refactoring. #79

Closed brianrusso closed 8 years ago

brianrusso commented 8 years ago

Added numpy 1.9.4 as requirement (argpartition bug was showing up in termite parsing code; it was fixed in np 1.9.4, numpy issue 5524) Added requirement for nose, stop_words In fileio/in_document_folder.py - Added support to ignore invalid UTF; but progress normally + log the fact that we encountered an error Added suitable test data (_junk) and test case to test_in_document_folder Added connectionerror handling for elasticsearch tests; if elasticsearch is not running, simply skip the tests Corrected tokenizer names in simple_run/cli.py Added stopword support to simple_run/run.py Corrected tokenizer names in simple_run/run.py Added logging in simple_run/run.py Tee generator in entities.py to avoid exhaustion. Support quadgrams and refactored code in ngrams.py Tee generator in ngrams.py + added some logging Added appropriate test case for quadgrams + tweaked test data in test_ngrams.py Added test case using a generator that will demonstrate exhaustion problem. All tests now succeeding (NB: ElasticSearch ones not tested - no changes aside from exception handling in tests though.)

brittainhard commented 8 years ago

@brianrusso are there any particular issues you're trying to solve with these commits? some context here would be helpful

brianrusso commented 8 years ago

Some are bugfixes; like the generator teeing; that was causing things not to work since your assumption in a lot of your code is for iterators but you actually create a lot of generators which you then exhaust; because you do something like foo = for k in generator(): # do stuff bar = for k(foo) in generator() # null corpus error since generator is exhausted.

The junk UTF code handling is for some data I am working with now (deTEX'd academic papers - end up with junk characters in them).

quadgrams is just a feature enhancement.

Most of the rest is just cleanup and minor refactoring.

msarahan commented 8 years ago

Lgtm, but only looking from a smart phone. Thanks @brianrusso. Please merge if you're happy here @brittainhard