A question regarding CFG morphological analysis

bmeaut / python_nlp_2017_fall

Material for the 2017 fall edition of the course "Introduction to Python and Human Language Technologies" at BME AUT

MIT License

12 stars 13 forks source link

A question regarding CFG morphological analysis #11

Open dobijan opened 6 years ago

dobijan commented 6 years ago

In the third homework one of the exercises is about implementing morphological analysis. It is unclear to me how to proceed with it. When we parse sentences, then the whitespaces determine the tag boundaries. But if all we get is one word, then what are the boundaries? Letters? Should we write a CFG grammar that accepts the input one letter at a time? Or are we supposed to break the word into chunks (leg, obb, at, et, etc...) based on some logic, and then apply the parsing on it? The Tree provided as an example implies the latter, but that chunking is basically almost the parsing itself... Besides, without the underlying automaton, if we look at the surface form, then the 'tag' boundaries can be ambiguous. So the 'some logic' can be very complex, for example a full FST... So, I am simply not sure what the exercise demands from me.

DavidNemeskey commented 6 years ago

Whitespaces don't determine the tag boundaries. If you look closely, you will see that we called .split() on the inputs to the parser exactly because of that. However, the parse() method expects a sequence of tokens, so for the Arithmetics exercise, these are equivalent:

aparser.parse('1 - 2 / ( 3 - 4 )'.split())  # list
aparser.parse(['1', '-', '2', '/', '(', '3', '-', '4', ')'])  # same as the previous one, actually
aparser.parse('1-2/(3-4)')  # string

So you can see that some parsing is going on. Two hints:

think about what your terminal symbols should be
your grammar doesn't have to be CNF for this exercise

dobijan commented 6 years ago

I decided to consider letters as base tokens. I wrote a letter-based grammar. Due to this the parse tree was a little cluttered, but then I postprocessed that tree to collapse sibling letter terminals into word parts. It works, I don't know whether this is the best solution, but the alternative seemed far more complex (somehow universally find the word parts in the word without an FSA, regexes for this purpose would be quite numerous and long). This way at least, only the grammar defines the word parts, and this information is not split between two distinct parts of the program.

I'm interested whether the others take a different approach.