bmeaut / python_nlp_2017_fall

Material for the 2017 fall edition of the course "Introduction to Python and Human Language Technologies" at BME AUT
MIT License
12 stars 13 forks source link

homework 1 exercise 3.2 #4

Closed alialsaeedi19 closed 6 years ago

alialsaeedi19 commented 6 years ago

I didn't get what this exercise exactly wants , like for example we have this example gen = generate_text("abc", 5, toy_freqs, 3) then we take the last N-1 characters which is "bc"

then the probability of "char" being generated after "bc" is pro(char | bc) = fre(bcchar) / fre(bc)

then we check this "char" for each possibility we have in the given dictionary , so In our example {'abc': 2, 'bca': 1, 'cab': 1, 'bcd': 1, 'cda': 1} this is the dictionary and we check this condition only for "bca" only and if we don't have bc.. then we take all of the "abc" and check the frequency of 4 gram dictionary the same way

correct me if anything is wrong

but how can we have the fre(bc)? from the full text "abcabcda"? and it says that this function return only one char at the time so for our example of bc up it will return only the a ? or the probability of a? or how ? and how can this return be 5 long if it return one char at a time ? should we put it in loop?

thanx in advance for your help :)

juditacs commented 6 years ago

Your understanding of the computation of the conditional probability is correct.

In your example two 3-grams start with bc so the distribution you want sample is {bca: 0.5, bcd: 0.5} since the both have the frequency 1. Your sampler should choose bca appr. half of the time and bcd the other half of the time. If you samples bca, then the next character of the generated string is a (so that it ends in bca).

There is a problem when you end up with a string that ends in some N-1-gram that did not appear in the training data. You can solve this any way you want (interpolation, sampling the full distribution) other than random sampling from the alphabet.

We generate one character at a time, so yes, you should use a loop.

alialsaeedi19 commented 6 years ago

thanx Alot for your help :)

what about the fre(bc) only , which need 2 grams dictionary how should we compute it? as we need the original text for that

juditacs commented 6 years ago

I'm not sure I understand your question but you should build the frequency dictionary from Wikipedia articles.

alialsaeedi19 commented 6 years ago

I mean here P(c|ab)=freq(abc)/freq(ab), we can get freq(abc) easily from the dictionary but freq(ab) we need 2 gram dictionary of the original text ( abcabcda)

juditacs commented 6 years ago

You build the 3-gram dictionary as well (Exercise 3.1).

You don't actually have to compute the bigram frequencies, you can derive the distribution of the characters after ab by filtering the trigram dictionary.

alialsaeedi19 commented 6 years ago

oh ook , thanx alot!