bmeaut / python_nlp_2018_spring

MIT License
8 stars 10 forks source link

homework1 #1

Open NadiaHajjej opened 6 years ago

NadiaHajjej commented 6 years ago

I have 2 questions: Q1- Regarding the first exercise "the edited distance", I would like to be sure if I understood well the idea behind the exercise, so what would be the result in the case of (abc & bca) ? Q2- Regarding the third exercise(Text generation) and specifically this point "If the generated text ends with a N−1N−1-gram that does not occur in the training data, generate the next character from the full character or ngram distribution." would you please explain it for me

juditacs commented 6 years ago

Q1 - The easiest way of transforming abc to bca is to delete a then insert a at the end of the string. Both operations cost 1, so the overall edit distance is 2.

Q2 - Your task here is to generate text one character at a time. You have to use the last N-1 characters for this but if you haven't encountered that n-gram in the training data, you don't have a distribution for the next character. There are various ways of solving this such as drawing from the global n-gram distribution or the unigram distribution. It is up to you which one to use as long as it makes sense (for example adding A all the time is not a good solution). I hope I cleared that up.

NadiaHajjej commented 6 years ago

thank you for the explanation but in the first exercise we are asked to Create a modify version of Levenshtein distance which discounts letters that are close to each other on the English keyboard that is why the distance between (S&W) IS 0.5 i m little bit confused so in case of (abc and bca ) according to the modified version what should be the distance. otherwise would you please give me another example through which i get better the process and thank you again

juditacs commented 6 years ago

The modified Levenshtein should only affect the indicator function, which checks if two characters are the same (see Wikipedia for more on this). The distance between abc and bca is 2. None of these characters are close to each other on the English keyboard, the distance in this case is not different from the normal Levenshtein distance.