Improve String parser to allow all unicode values

MitchTurner commented 2 years ago

The Plutus Core spec says that strings are allowed to be any Unicode string. The parser currently doesn't support that. For example, my proptest quickly found this innocuous string that broke the parser:

"z2�ட@ઋ𑌷Ⱥ\"¥`\\?𐊧�M'ㄸ·ä�"

Specifically, the quotes in the middle mess it up.

Probably will never come up, but it's good to uphold contracts even if they are edge cases.

SmaugPool commented 1 year ago

Problem

The issue is to decide what are valid escape sequences.

For now Aiken UPLC parser does not support any:

rule string() -> String
    = "\"" s:[^ '"']* "\"" { String::from_iter(s) }

Plutus Spec

The Plutus Core Spec says in Appendix A.1:

Concrete syntax for strings. Strings are represented as sequences of Unicode characters enclosed in double quotes, and may include standard escape sequences.

However despite some escape sequences being standardized for some languages, like C, there is as far as I know no "standard escape sequences".

PlutusTx

PlutusTx conText seems to rely on megaparsec charLiteral:

-- | Parser for string constants. They are wrapped in double quotes.    
conText :: Parser T.Text    
conText = lexeme . fmap T.pack $ char '\"' *> manyTill Lex.charLiteral (char '\"')

Which implements the Haskell Report grammar rules:

The literal character is parsed according to the grammar rules defined in the Haskell report.

I'm not sure what is supported by those exactly, it seems to be: https://book.realworldhaskell.org/read/characters-strings-and-escaping-rules.html

Which includes quite a lot of non common ones and use \xHEX for unicode escape sequence (instead of C \uHEX or common \u{HEX} like in rust).

Aiken

It may also make sense to have the same escape sequences supported in UPLC Aiken compiler than in Aiken language.

For now Aiken seems to support a few single character escape sequences in escape lexer, but no unicode ones:

let escape = just('\\').ignore_then(        
    just('\\')            
        .or(just('/'))            
        .or(just('"'))            
        .or(just('b').to('\x08'))            
        .or(just('f').to('\x0C'))            
        .or(just('n').to('\n'))            
        .or(just('r').to('\r'))            
        .or(just('t').to('\t')),            
);

Also not sure why it supports the weird \/ one that does not require escaping.

Conclusion

We need to decide on a "standard" set of escape sequences and if it should match PlutuxTx one or not, maybe trying to get it official in Plutus Core spec.
We need to decide if we want the same one in Aiken language and Aiken UPLC compiler.
We can then implement those, with tests and documentation in the guide.

rvcas commented 1 year ago

@SmaugPool

cool that makes sense. Thanks for writing this.

aiken-lang / aiken