add good-turing smoothing to distribution.jl

chucklesoclock commented 9 years ago

It's noted in all the journals as the best unigram smoother. Have skeleton of psuedo code written

chucklesoclock commented 9 years ago

So haven't made much progress during spring break.

But what I will need for good-turing smoothing is a list of the awkwardly-worded "frequency of frequencies vector".

That is, if the text is "He uses statistics like a drunken man uses a lamp post—more for support than illumination," (Andrew Lang) then our feature vector fv will be:

he --> 1
uses --> 2
statistics --> 1
like --> 1
a --> 2
drunken --> 1
man --> 1
lamp --> 1
post --> 1
more --> 1
for --> 1
support --> 1
than --> 1
illumination --> 1

But the frequency of frequencies vector will be

keys occurring once --> 12
keys occurring twice --> 2
keys occurring >= thrice --> 0

In general, if r = (a frequency value for keys), then I need an N_r = (how many keys occur r times).

Its main benefit is a output of the probability of unseen keys = N_1 / N, where N = (total number of keys).

There also might be code out there already that does some of the heavy lifting: http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation#cite_ref-6

I'll get on it tomorrow

mtabor150 commented 9 years ago

is this done?

chucklesoclock commented 9 years ago

Almost I need a little help in best practice implementation

On Wed, Apr 8, 2015 at 12:51 PM, Mark Tabor notifications@github.com wrote:

is this done?

— Reply to this email directly or view it on GitHub https://github.com/SLU-TMI/TextMining.jl/issues/54#issuecomment-90986964 .

SLU-TMI / TextMining.jl

add good-turing smoothing to distribution.jl #54