String refinement types

njordhov commented 4 years ago

Clarity should support declaring string refinement types that refine the native string type from issue #3 with an optional character set. This is similar to the integer refinement types proposed in issue #13, as well as the list type refined with an entry-type, with a similar syntax. For example, the code below declares a 7 character string that consists of letters from the latin1 character set:

(string 7 latin1)

Refining string types with character sets has multiple benefits, including:

1) It can protect against many exploits involving UTF-8 homoglyphs, by allowing contracts developers to restrict the characters used in strings.

2) Limiting the range of characters allows less costly storage of strings, instead of charging based on pessimistic cost estimates that assume 4 bytes for each UTF-8 character.

3) The VM can verify that submitted strings only contains the expected characters, automating the type checking instead of requiring string validation in the code.

ISO 8859-1 aka latin1 is a good initial character set to offer for string refinement types. It consists of the first 256 code points in UTF-8, providing a superset of ASCII that can be stored as a single byte per character.

~~Blockstack~~ Stacks is an international community. While ASCII is anglo-centric only providing the letters a-z, the latin1 character set provides letters for a wide range of written languages that use latin-based alphabets. Requiring use of full UTF-8 to go beyond a-z penalizes those supporting other languages, not only with a 4x added cost, but also with the risk of exploits.

Going forward, string refinement types should support other character sets and further refinement options as needed by the community.

psq commented 4 years ago

I would quote the refinement name rather than needing a new keyword for each new refinement:

(string 7 "latin1")

psq commented 4 years ago

would this allow automatic conversion if going to a superset? i.e. a latin1 string can be used anywhere an utf-8 one can be. I would think yes.

What about the other way around? Not allowed, or runtime failure is some characters do not fit the constraint (my preferred way).

clarity-lang / reference

String refinement types #19