franko / luajit-lang-toolkit

A Lua bytecode compiler written in Lua itself for didactic purposes or for new language implementations
Other
647 stars 92 forks source link

Escape character not preserved in generated Lua #26

Open gnois opened 9 years ago

gnois commented 9 years ago

I reasoned it's better to file the issues here rather than in my fork, because the issue links got confused in the commit history, pointing to issue of the same number in this repo.

print("\n\r")

generates this Lua code output.

print("\
\13")

The escapes should be preserved.

gnois commented 9 years ago

Hi Franko,

Just to let you know I have to redo the fix in 9342e5e in lexer instead of code generator, because I realized the lexer had already transformed \ddd escapes by the time it is done. With the included test case, luajit-lang-toolkit gives

print("alo\n123")
print("\\\n\r\"''\0")
print("\3\vD\t\"'")
print("■\v\\\\\"\\'\f")

while this fix gives a more verbatim code with source

print("\97lo\10\04923")
print("\\\n\r\"'\'\0")
print("\3\v\x44\t\"\'")
print("\254\v\\\\\"\\\'\f")

Anyway I am not sure if this is relevant for luajit-lang-toolkit, because the running output will be the same anyway and error message deviates from luajit. For eg:

print('\3\v\x4h4\t\"\'')

luajit and luajit-lang-toolkit gives

invalid escape sequence near ''♥♂'

while my fix gives

invalid escape sequence near '\3\v\x4
franko commented 9 years ago

Hi,

I've given a look to the commit, the idea is interesting. Basically in the lexer you just save the string in its original form, without replacing the escape sequences. In this way the code generator is trivial because you just output the string as it was in its original form, no need to interpret the escape sequences.

In reality your commit does two things because in addition you added a modification to raise un error when the escape sequence is not valid.

I could adopt the approach of storing the string in the original form but this would require a modification the bytecode generator to interpret the escape sequences.

I think also, like you, that for the language toolkit it is better to interpret the escape sequences in the lexer phase. It is not "natural" to replace escape sequences only in the bytecode generator, at the end of the pipeline. Probably keeping the string in its original form is ok for languages targeting only the Lua code generation.

I think I'm going to cherry-pick in your commit only the part when you raise un error for invalid escape sequences.

In any case thank you for sharing that, your focus on quality and clear error message is really a good thing.