bmeaut / python_nlp_2017_fall

Material for the 2017 fall edition of the course "Introduction to Python and Human Language Technologies" at BME AUT
MIT License
12 stars 13 forks source link

Flag diacritics #14

Open balintdom opened 6 years ago

balintdom commented 6 years ago

Hello, I can't understand what flag diacritics do. Could you give me a simple working example, and a step-by-step explanation about what they are doing? Unfortunately this time, google is not really my friend...

matekatona commented 6 years ago

the example near the end of this page helped me a lot: https://fomafst.github.io/morphtut.html

DavidNemeskey commented 6 years ago

It is best to think of flag diacritics as variable( setting/reading instruction)s in a programming language. FSAs/FSTs are memory-less in the sense that the current state and the current input determines the new state (and the output); previous inputs and states have no influence. This can be a problem, when you need to remember something.

An example would be the language "ba+b|ca+c". Here you must remember what the first symbol was, and only accept the string if it ends with the same symbol. Ideally, you would want to do something like:

Lexicon Root
b [input = b] As ;  ! Store 'b' as the variable 'input'
c [input = c] As ;  ! Store 'c' as the variable 'input'

Lexicon As
a As ;
a End ;

Lexicon End
[input] # ;  ! Load the content of the variable 'input' and expect that again on the (input) tape

Only one problem: FSAs don't have a memory so we don't have any variables we could "save" the input to / "load" it from. So one valid lexc solution would be:

Lexicon Root
b As_after_b ;
c As_after_c ;

Lexicon As_after_b
a As_after_b ;
a End_with_b ;

Lexicon As_after_c
a As_after_c ;
a End_with_c ;

Lexicon End_with_b
b # ;

Lexicon End_with_c
c # ;

You can see the problem already: since you don't have any memory, you have to encode the information that you read b or c into the state space; or in lexc, into the continuations that you use. In this case, you had to split End into the two _Endwith*_ lexica so that you can expect b in one and c in the other. But that's not enough, you also had to split As into two as well, because that is the only way you can encode what you read previously. And if you have not only b and c, but e.g. [a-z], then you would need to split the "ideal" lexica As and End into 26 parts...

So enter flag diacritics. Just think of them as operations on a variable:

Lexicon Root b@P.INPUT.B@ As ; ! Store 'b' as the variable 'input' c@P.INPUT.C@ As ; ! Store 'c' as the variable 'input'

Lexicon As a As ; a End ;

Lexicon End b@R.INPUT.B@ # ; ! if INPUT == B c@R.INPUT.C@ # ; ! if INPUT == C



You can see it is almost as simple as the ideal solution. One important difference is that the `B` in `@P.INPUT.b@` (etc.) has nothing to do with the `b` we read on the tape, because the values of the flag variables live in a different namespace. So we could have called `B` `X` and `C` `Y`, and it would have worked the same.

In this particular case btw. we could have used `@U...@` (unification) instead of both `@P...@`  and `@R...@` -- `U` on a variable is `P` the first time and `R` afterwards.

Hope this helps.
balintdom commented 6 years ago

Yes, it is really helpful. Thank you!