GaloisInc / cryptol

Cryptol: The Language of Cryptography
https://galoisinc.github.io/cryptol/master/RefMan.html
BSD 3-Clause "New" or "Revised" License
1.14k stars 126 forks source link

Support Unicode Strings #863

Open jpziegler opened 4 years ago

jpziegler commented 4 years ago

Many crypto puzzles, such as substitution ciphers, use unusual characters in their alphabets. (For example, Edgar Allen Poe's Gold-Bug cipher.) It would be useful, for demos and recreational crypto at least, to allow such characters to be used as data in Cryptol.

Currently, Cryptol supports Unicode for symbol names, but not Strings.

It's possible that variable-length encodings would be challenging given Cryptol's type system, but UTF-32 might make sense.

brianhuffman commented 4 years ago

I can think of a few possible designs for this.

  1. Unicode strings denote lists of UTF-8 bytes. This means that a string like "día" would denote the 4-byte string [0x64,0xC3,0xAD,0x61]. Unfortunately single-character literals wouldn't work beyond the 128 ascii codepoints, as those are the only 1-byte UTF-8 points. Also, it seems like a variable-length encoding would not be particularly useful for e.g. substitution ciphers. For these reasons I don't like this choice.

  2. Redefine Char from [8] to a larger fixed width. We could choose [16], which would cover most of unicode, or [32], which would yield UTF-32, as you suggested. One problem is that existing cryptol code written to use the 8-bit Char type will break. Also (and I don't know if this is actually something that people do) currently we can use :readByteArray and :writeByteArray to read/write strings to files, and we'd lose that ability if we got rid of 8-bit strings.

  3. Characters have polymorphic width, just like decimal literals. This would allow ascii strings to have type [n][8] (or even [n][7]), whereas strings containing codepoints beyond the first 256 could fit into wider types like [n][16] or [n][32]. So e.g. a character literal like 'Γ' would mean exactly the same thing as writing the decimal number 915, and we'd have 'Γ' : {a} (Literal 915 a) => a.

I think choice 3 would be really easy to implement, because we basically implement it that way already (except that we add an extra type constraint of [8] to every character) as we can see from this type error message:

Cryptol> 'Γ'

[error] at <interactive>:1:1--1:4:
  Unsolvable constraints:
    • 8 >= 10
        arising from
        use of literal or demoted expression
        at <interactive>:1:1--1:4
    • Reason: It is not the case that 8 >= 10
weaversa commented 4 years ago

To support variant 3: We have done specs that use both 6- and 7-bit ascii, and the conversions were a bit cumbersome.

brianhuffman commented 4 years ago

I made a draft PR (#864) so that people can try out polymorphic character literals, and let us know what you think. It was indeed very easy to implement.

The change breaks a few tests in our regression test suite: I had to add type constraints to string literals in six different tests in order to make them work. (The failures were things like :sat or :check complaining about a non-monomorphic type, or warnings about defaulting to Integer.)

jpziegler commented 4 years ago

An alternative solution could be some kind of printing function or command, such as an enhanced trace.