Closed muscar closed 9 years ago
Thanks for the bug report, and useful test case!
Proper unicode support is on the TODO list. I was waiting for Alex (the lexer library to support unicode). It has done so for a while now, so it is probably time to fix this issue. If you really need this feature urgently then feel free to get hacking, I will gladly accept patches. Otherwise I will try to get around to it when a chunk of spare time becomes available.
No problem :).
I can try to implement it, but I haven't used Alex before. I can give it a go if can point me in the right direction. I looked at the Lexer.x
file in the source tree, and the definition for short string literals ($short_str_char = [^ \n \r ' \" \\]
) seems like it should support unicode. I created a small test program with a lexer using this definition and it seems to work fine. The only difference is that I was using the basic
Alex wrapper and alexScanTokens
. I see that the lexer in the source tree is not using a wrapper so I guess that's a starting point, but, as I said, any hints as to where to start would be great.
Cool!
I don't have any good pointers other than the Alex documentation. I just did a cursory scan of the docs and it does seem like it should "just work", but then again I haven't thought about it very hard.
This has been on my TODO list for ages, so I really appreciate that you are looking into it. It would be great to get to the bottom of the problem.
Ok, I'll try to see if I can fix this issue :).
Hi @muscar I have addressed this issue in commit 24fd2a271d69029d876dd03b2c9ffd49c15d22af
The parser now supports files in UTF8 encoding. I'm not sure if I will bother with other encodings at the moment because the user can easily convert to UTF8 before parsing.
If you get the chance to test this out and you find any problems please report them in this issue.
Also note that there is a separate package for testing the parser: https://github.com/bjpop/language-python-test
The parser doesn't seem to support characters from the Latin-1 supplement unicode range.
Use the following program to test:
Running the program with unicode characters from the Latin-1 supplement doesn't work: