Closed alialsaeedi19 closed 6 years ago
Your understanding of the computation of the conditional probability is correct.
In your example two 3-grams start with bc
so the distribution you want sample is {bca: 0.5, bcd: 0.5}
since the both have the frequency 1. Your sampler should choose bca
appr. half of the time and bcd
the other half of the time. If you samples bca
, then the next character of the generated string is a
(so that it ends in bca
).
There is a problem when you end up with a string that ends in some N-1-gram that did not appear in the training data. You can solve this any way you want (interpolation, sampling the full distribution) other than random sampling from the alphabet.
We generate one character at a time, so yes, you should use a loop.
thanx Alot for your help :)
what about the fre(bc) only , which need 2 grams dictionary how should we compute it? as we need the original text for that
I'm not sure I understand your question but you should build the frequency dictionary from Wikipedia articles.
I mean here P(c|ab)=freq(abc)/freq(ab), we can get freq(abc) easily from the dictionary but freq(ab) we need 2 gram dictionary of the original text ( abcabcda)
You build the 3-gram dictionary as well (Exercise 3.1).
You don't actually have to compute the bigram frequencies, you can derive the distribution of the characters after ab
by filtering the trigram dictionary.
oh ook , thanx alot!
I didn't get what this exercise exactly wants , like for example we have this example gen = generate_text("abc", 5, toy_freqs, 3) then we take the last N-1 characters which is "bc"
then the probability of "char" being generated after "bc" is pro(char | bc) = fre(bcchar) / fre(bc)
then we check this "char" for each possibility we have in the given dictionary , so In our example {'abc': 2, 'bca': 1, 'cab': 1, 'bcd': 1, 'cda': 1} this is the dictionary and we check this condition only for "bca" only and if we don't have bc.. then we take all of the "abc" and check the frequency of 4 gram dictionary the same way
correct me if anything is wrong
but how can we have the fre(bc)? from the full text "abcabcda"? and it says that this function return only one char at the time so for our example of bc up it will return only the a ? or the probability of a? or how ? and how can this return be 5 long if it return one char at a time ? should we put it in loop?
thanx in advance for your help :)