Latin-1 supplement characters not recognised in string literals

bjpop / language-python

A parser for Python 2.x and 3.x written in Haskell

157 stars 46 forks source link

Latin-1 supplement characters not recognised in string literals #21

Closed muscar closed 9 years ago

muscar commented 9 years ago

The parser doesn't seem to support characters from the Latin-1 supplement unicode range.

Use the following program to test:

module Main where

import System.Environment
import Language.Python.Version2.Lexer

main :: IO ()
main = do
  args <- getArgs
  case Language.Python.Version2.Lexer.lex (args !! 0) "<test>" of
    Left err -> putStrLn $ show err
    Right tokens -> mapM_ putStrLn $ map show tokens

Running the program with unicode characters from the Latin-1 supplement doesn't work:

> test "u'¡'"
UnexpectedChar '\'' (Sloc {sloc_filename = "<test>", sloc_row = 1, sloc_column = 2})
> test "u'™'"
UnicodeStringToken {token_span = SpanCoLinear {span_filename = "<test>", span_row = 1, span_start_column = 1, span_end_column = 4}, token_literal = "u'\8482'"}
NewlineToken {token_span = SpanPoint {span_filename = "<test>", span_row = 1, span_column = 5}}
> test "u'£'"
UnexpectedChar '\'' (Sloc {sloc_filename = "<test>", sloc_row = 1, sloc_column = 2})
>

bjpop commented 9 years ago

Thanks for the bug report, and useful test case!

Proper unicode support is on the TODO list. I was waiting for Alex (the lexer library to support unicode). It has done so for a while now, so it is probably time to fix this issue. If you really need this feature urgently then feel free to get hacking, I will gladly accept patches. Otherwise I will try to get around to it when a chunk of spare time becomes available.

muscar commented 9 years ago

No problem :).

I can try to implement it, but I haven't used Alex before. I can give it a go if can point me in the right direction. I looked at the Lexer.x file in the source tree, and the definition for short string literals ($short_str_char = [^ \n \r ' \" \\]) seems like it should support unicode. I created a small test program with a lexer using this definition and it seems to work fine. The only difference is that I was using the basic Alex wrapper and alexScanTokens. I see that the lexer in the source tree is not using a wrapper so I guess that's a starting point, but, as I said, any hints as to where to start would be great.

bjpop commented 9 years ago

Cool!

I don't have any good pointers other than the Alex documentation. I just did a cursory scan of the docs and it does seem like it should "just work", but then again I haven't thought about it very hard.

This has been on my TODO list for ages, so I really appreciate that you are looking into it. It would be great to get to the bottom of the problem.

muscar commented 9 years ago

Ok, I'll try to see if I can fix this issue :).

bjpop commented 9 years ago

Hi @muscar I have addressed this issue in commit 24fd2a271d69029d876dd03b2c9ffd49c15d22af

The parser now supports files in UTF8 encoding. I'm not sure if I will bother with other encodings at the moment because the user can easily convert to UTF8 before parsing.

If you get the chance to test this out and you find any problems please report them in this issue.

Also note that there is a separate package for testing the parser: https://github.com/bjpop/language-python-test