Trouble with epsilon in CFG

vadyushkins commented 4 years ago

$ python3
Python 3.7.5 (default, Apr 19 2020, 20:18:17) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyformlang.cfg import CFG
>>> gr = CFG.from_text('S -> a S b S\nS -> ')
>>> gr.generate_epsilon()
True
>>> new_gr = gr.to_normal_form()
>>> new_gr.generate_epsilon()
False

But I think that language generated by new_gr should be equal to language generated by gr. Including an epsilon.

Aunsiels commented 4 years ago

The transformation to normal form does not work when the language generates epsilon. I will write a warning about it. In this case, it normalise the language without epsilon.

I checked several sources about it, and it seems that the article on Wikipedia about the normal form is not correct. Then, the error propagated. It refers to Hopcroft's textbook introduction to automata theory languages and computation, but the definition is not the same. The original is what is called Chomsky reduced form on Wikipedia. I suggest you read Hopcroft's chapter about it and make your own opinion. I also checked Chomsky original paper, but the notation are too different from what we use now. So it is hard to conclude what really is Chomsky normal form. Generally, Hopcroft is the reference.

If you have another source, I take it. However, as most algorithms are from Hopcroft's book, changing the normal form might have impacts at other places.

gsvgit commented 4 years ago

Hello. @Aunsiels I guess you may be confused the way used for this transformation description. For example, look at the Parsing Techniques by Dick Grune (https://www.google.com/url?sa=t&source=web&rct=j&url=https://dickgrune.com/Books/PTAPG_1st_Edition/BookBody.pdf&ved=2ahUKEwjA05SloIbsAhXSl4sKHc9xBu0QFjAEegQIAhAB&usg=AOvVaw1rCCDoN8wpDI9548A_y3M6) In the section section 4.2.3 which describes transformation to CNF, you can see a step for epsilon rules elimination. But look into section 4.2.3.1 carefully (this section describes details of epsilon rules elimination). Here you can see, that in order to preserve the language the rule s->eps may be required. So, I think that @viabzalov is right. Moreover, main property of any grammar normalization (including CNF and gnf) is language preservation.

Aunsiels commented 4 years ago

Hi! Thank you for this additional source. I read it and it confirms my saying. For the book:

Two of the restrictions that we want to impose on the grammar are obvious by now: no unit rules and no ε-rules.[...] A grammar is in Chomsky Normal Form (CNF), when all rules either have the form A→a, or A→BC, where a is a terminal and A, B, and C are non-terminals. [...] There are no ε-rules in a CNF grammar

For this definition and all formal proofs, they cite Hopcroft and Ullman. Under these conditions, the language cannot generate epsilon anyway. Hopcroft clearly states that:

Theorem 7.16 If G is a CFG whose language contains at least one string other than ε then there is a grammar G1 in Chomsky Normal such that L(G1) = L(G) - {ε}.

I am a bit afraid to change this definition of the normal form and it might impact other algorithms.

To me, the main goal of the normal form is to make other algorithms effective. If the grammar generates epsilon, it is handled by the algorithms using the normal form, as it is done in Hopcroft.

I hope it answers your question.

Aunsiels / pyformlang

Trouble with epsilon in CFG #2