SLU-TMI / TextMining.jl

Other
24 stars 7 forks source link

add good-turing smoothing to distribution.jl #54

Open chucklesoclock opened 9 years ago

chucklesoclock commented 9 years ago

It's noted in all the journals as the best unigram smoother. Have skeleton of psuedo code written

chucklesoclock commented 9 years ago

So haven't made much progress during spring break.

But what I will need for good-turing smoothing is a list of the awkwardly-worded "frequency of frequencies vector".

That is, if the text is "He uses statistics like a drunken man uses a lamp post—more for support than illumination," (Andrew Lang) then our feature vector fv will be:

But the frequency of frequencies vector will be

In general, if r = (a frequency value for keys), then I need an N_r = (how many keys occur r times).

Its main benefit is a output of the probability of unseen keys = N_1 / N, where N = (total number of keys).

There also might be code out there already that does some of the heavy lifting: http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation#cite_ref-6

I'll get on it tomorrow

mtabor150 commented 9 years ago

is this done?

chucklesoclock commented 9 years ago

Almost I need a little help in best practice implementation

On Wed, Apr 8, 2015 at 12:51 PM, Mark Tabor notifications@github.com wrote:

is this done?

— Reply to this email directly or view it on GitHub https://github.com/SLU-TMI/TextMining.jl/issues/54#issuecomment-90986964 .