Simplified the bookkeeping logic (for now) to always fill the context afresh. This is to make sure that we retrieve predictions for the exact text passed to the callback.
Additional debugging functions to make sure the interface to PPM is sane (these are enabled by enabling the verbosity flag).
Got rid of tiny set of Enron sentences, replacing them with more decent sized corpora Alice's adventures in Wonderland and Adventures of Sherlock Holmes from Project Gutenberg. For now these are stored as const strings under third_party/gutenberg, the LICENSE file points here.
The predictor now exposes two APIs: the actual predictor (ppmModelPredict) and a function to retrain the model from scratch using the static training data and the text supplied by the caller (ppmModelReset).
Once we are in agreement that this predictor functions as expected, I'll optimize for speed and possibly memory.
This PR has several notable changes:
third_party/gutenberg
, theLICENSE
file points here.ppmModelPredict
) and a function to retrain the model from scratch using the static training data and the text supplied by the caller (ppmModelReset
).Once we are in agreement that this predictor functions as expected, I'll optimize for speed and possibly memory.