Open sevinjyolchuyeva opened 7 years ago
Based on your example, let's assume that you have the following 2-grams:
{'ab': 2, 'bc': 2, 'ca': 1, 'cd': 1, 'da': 1}
So if e.g. you have a sequence 'bc' and you want to generate the next character for this sequence, you just look at the 3-grams, which start with 'bc'. These are:
{'bca': 1, 'bcd': 1}
and of course you also have among the 2-grams:
{'bc': 2}
so the probability of generating the character 'a' given the sequence 'bc' is defined as:
P(a|bc) = freq(bca) / freq(bc) = 1 / 2
Similarly the probability of generating the character 'd' given 'bc' is:
P(d|bc) = freq(bcd) / freq(bc) = 1 / 2
So whenever you encounter that the already generated sequence ends with 'bc' half the time you should generate the character 'a' and the other half the character 'd'.
Thank for answering. In that situation, function output should be 'bcd' or 'bca' ? It means, we had 2-grams and we generated 3-grams. Is it true? For the exercise, gen = generate_text("abc", 5, toy_freqs, 3) means that we should generate 5-grams given 3-grams ( or 2-grams) ??
Not exactly.
5 is the length of the desired output. It could be much longer than 5 and you should test your solution for larger values such as 200 or 300.
N is the base of the generation. If N=3 and the string ends with bc
then the distribution used for sampling the next character is the distribution of all trigrams (3-grams) that start with bc
. Changing @DanLszl example a little bit, let's assume that we find the following trigrams starting with bc
:
{'bca': 2, 'bcb': 1, 'bcc': 1}
you should generate a
with probability 0.5 and b
and c
with probability 0.25 each (so a
is generated 1 out of 2 times and b
and c
1 out of 4 times on average). I changed the example to demonstrate that uniform sampling is NOT correct, not all trigrams are equally probable.
You don't need to and shouldn't generate longer n-grams than N.
Thank so much.
Could you please write continue with that Exercise 3.2. (Define a text generator function)
word='abcabcda' toy_freqs = count_ngram_freqs("abcabcda", 3) : {'abc': 2, 'bca': 1, 'cab': 1, 'bcd': 1, 'cda': 1}
How should we use probability? That probability is number include [0,1] and what is condition of using it?