Homework 1 - text generation

bmeaut / python_nlp_2017_fall

Material for the 2017 fall edition of the course "Introduction to Python and Human Language Technologies" at BME AUT

MIT License

12 stars 13 forks source link

Homework 1 - text generation #3

Open sevinjyolchuyeva opened 7 years ago

sevinjyolchuyeva commented 7 years ago

Could you please write continue with that Exercise 3.2. (Define a text generator function)

word='abcabcda' toy_freqs = count_ngram_freqs("abcabcda", 3) : {'abc': 2, 'bca': 1, 'cab': 1, 'bcd': 1, 'cda': 1}

How should we use probability? That probability is number include [0,1] and what is condition of using it?

DanielLaszlo commented 7 years ago

Based on your example, let's assume that you have the following 2-grams: {'ab': 2, 'bc': 2, 'ca': 1, 'cd': 1, 'da': 1}

So if e.g. you have a sequence 'bc' and you want to generate the next character for this sequence, you just look at the 3-grams, which start with 'bc'. These are: {'bca': 1, 'bcd': 1} and of course you also have among the 2-grams: {'bc': 2} so the probability of generating the character 'a' given the sequence 'bc' is defined as: P(a|bc) = freq(bca) / freq(bc) = 1 / 2 Similarly the probability of generating the character 'd' given 'bc' is: P(d|bc) = freq(bcd) / freq(bc) = 1 / 2

So whenever you encounter that the already generated sequence ends with 'bc' half the time you should generate the character 'a' and the other half the character 'd'.

sevinjyolchuyeva commented 7 years ago

Thank for answering. In that situation, function output should be 'bcd' or 'bca' ? It means, we had 2-grams and we generated 3-grams. Is it true? For the exercise, gen = generate_text("abc", 5, toy_freqs, 3) means that we should generate 5-grams given 3-grams ( or 2-grams) ??

juditacs commented 7 years ago

Not exactly.

5 is the length of the desired output. It could be much longer than 5 and you should test your solution for larger values such as 200 or 300.

N is the base of the generation. If N=3 and the string ends with bc then the distribution used for sampling the next character is the distribution of all trigrams (3-grams) that start with bc. Changing @DanLszl example a little bit, let's assume that we find the following trigrams starting with bc:

{'bca': 2, 'bcb': 1, 'bcc': 1}

you should generate a with probability 0.5 and b and c with probability 0.25 each (so a is generated 1 out of 2 times and b and c 1 out of 4 times on average). I changed the example to demonstrate that uniform sampling is NOT correct, not all trigrams are equally probable.

You don't need to and shouldn't generate longer n-grams than N.

sevinjyolchuyeva commented 7 years ago

Thank so much.