Raku / old-design-docs

Raku language design documents
https://design.raku.org/
Artistic License 2.0
123 stars 36 forks source link

<:Foo> syntax in regexes ambiguous #118

Open jnthn opened 7 years ago

jnthn commented 7 years ago

In S05 it defines <:Foo> as:

Unicode properties are indicated by use of pair notation in place of a normal rule name:

<:Letter>   # a letter
<:!Letter>  # a non-letter

Properties with arguments are passed as the argument to the pair:

<:East_Asian_Width<Narrow>>
<:!Blk<ASCII>>

The second form is unambiguous. The first, less so. Here's a quote from the Unicode database (in PropertyValueAliases.txt):

NOTE: Property value names are NOT unique across properties. For example:

AL means Arabic Letter for the Bidi_Class property, and AL means Above_Left for the Canonical_Combining_Class property, and AL means Alphabetic for the Line_Break property.

In addition, some property names may be the same as some property value names. For example:

sc means the Script property, and Sc means the General_Category property value Currency_Symbol (Sc)

The combination of property value and property name is, however, unique.

Which raises the question of what <:AL> would mean, or <:Sc>. The one that actually tripped me up is <:space>, which can either be an alias for the WSpace property (per PropertyAliases.txt):

WSpace                   ; White_Space                 ; space

Or a property value name from the linebreak property:

lb ; SP                               ; Space

The ambiguity is currently resolved by the order we make entries into the lookup hash, which is defined by the order we generate the C code in ucd2c.pl, which in term is randomized due to Perl 5 hash order randomization. So, you can get a spectest fails, regenerate from the exact same Unicode database version and ucd2c.pl, and "get lucky" next time around. I came upon this by getting "unlucky" when doing the Unicode 9 database version bump, but it's been a problem all along.

patch commented 7 years ago

I think S05 is lacking intended details. No regex engine allows for arbitrary property values for all properties without the associated names, due to the obvious conflicts. Most regex engines allow standalone General_Category values and some allow standalone Script and/or Block values (which do conflict). Perl 5 supports all three, with a preference for Script over Block when they conflict.

This discussion is happening simultaneously for an active ECMAScript proposal and the current plan is to only support standalone values for General_Category with the option to expand in the future if needed: https://github.com/mathiasbynens/es-regexp-unicode-property-escapes

Also, supporting Script instead of Script_Extension would be a mistake since the latter is generally what people expect and should be encouraged over Script. I personally think that the General_Category-only route is by far the safest and most straightforward. If an additional property were to be supported, Script_Extension is the next most useful and does not conflict with General_Category by design.

patch commented 7 years ago

A good description of Script (sc) vs. Script_Extensions (scx): http://unicode.org/reports/tr18/#Script_Property

samcv commented 7 years ago

As part of my Unicode Grant I am having to address this.

From a perspective of implementing it on MoarVM, we are given a name, lets say "Latin" and look up what property is associated with that. In this case it would be the "Script" property.

Currently MoarVM throws all the property values in together and assumes that they are distinct with one property value to one specific property, which does not work in practice.

As I work on re-implementing this part of the code I need to decide which property values should be resolvable to property names (which is needed for regex without specifying the actual property you are trying to query).

I am going to put together a list of all of the conflicts and we can hopefully decide how we want to go about prioritizing them. Or at the very least knowing where all the overlaps are and which ones we want to prioritize and which are inconsequential.

samcv commented 7 years ago
# All except <True False T F Yes No Y N> and Script/Block overlaps
L => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type", "Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class", "General_Category", "Joining_Type"],
Other => ["Indic_Syllabic_Category", "Grapheme_Cluster_Break", "Word_Break", "Sentence_Break", "General_Category"],
EX => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
Numeric => ["Word_Break", "Line_Break", "Sentence_Break", "Numeric_Type"],
XX => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
CR => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
R => ["Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class", "Joining_Type"],
M => ["NFKC_Quick_Check", "Jamo_Short_Name", "General_Category", "NFC_Quick_Check"],
LF => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
Regional_Indicator => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
AL => ["Bidi_Class", "Canonical_Combining_Class", "Line_Break"],
EM => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
NU => ["Word_Break", "Line_Break", "Sentence_Break"],
A => ["East_Asian_Width", "Jamo_Short_Name", "Canonical_Combining_Class"],
E_Base => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
RI => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
B => ["Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class"],
ZWJ => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
EB => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
Extend => ["Grapheme_Cluster_Break", "Word_Break", "Sentence_Break"],
None => ["Bidi_Paired_Bracket_Type", "Decomposition_Type", "Numeric_Type"],
S => ["Bidi_Class", "Jamo_Short_Name", "General_Category"],
E_Modifier => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
NA => ["Age", "Hangul_Syllable_Type", "Indic_Positional_Category"],
Format => ["Word_Break", "Sentence_Break", "General_Category"],
C => ["Jamo_Short_Name", "General_Category", "Joining_Type"],
Right => ["Canonical_Combining_Class", "Indic_Positional_Category"],
Unassigned => ["Age", "General_Category"],
Control => ["Grapheme_Cluster_Break", "General_Category"],
Nukta => ["Indic_Syllabic_Category", "Canonical_Combining_Class"],
E => ["Joining_Group", "Jamo_Short_Name"],
Surrogate => ["Line_Break", "General_Category"],
Punctuation => ["Block", "General_Category"],
V => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
Nonspacing_Mark => ["Bidi_Class", "General_Category"],
Number => ["Indic_Syllabic_Category", "General_Category"],
SP => ["Line_Break", "Sentence_Break"],
E_Base_GAZ => ["Grapheme_Cluster_Break", "Word_Break"],
Close_Punctuation => ["Line_Break", "General_Category"],
Unknown => ["Script", "Line_Break"],
GAZ => ["Grapheme_Cluster_Break", "Word_Break"],
LV => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
IS => ["Canonical_Combining_Class", "Line_Break"],
CL => ["Line_Break", "Sentence_Break"],
Open_Punctuation => ["Line_Break", "General_Category"],
Private_Use => ["Block", "General_Category"],
Paragraph_Separator => ["Bidi_Class", "General_Category"],
Pe => ["Joining_Group", "General_Category"],
D => ["Jamo_Short_Name", "Joining_Type"],
Narrow => ["East_Asian_Width", "Decomposition_Type"],
NL => ["Word_Break", "Line_Break"],
Wide => ["East_Asian_Width", "Decomposition_Type"],
Virama => ["Indic_Syllabic_Category", "Canonical_Combining_Class"],
Hebrew_Letter => ["Word_Break", "Line_Break"],
U => ["Jamo_Short_Name", "Joining_Type"],
LE => ["Word_Break", "Sentence_Break"],
Left => ["Canonical_Combining_Class", "Indic_Positional_Category"],
Glue_After_Zwj => ["Grapheme_Cluster_Break", "Word_Break"],
Close => ["Bidi_Paired_Bracket_Type", "Sentence_Break"],
BB => ["Jamo_Short_Name", "Line_Break"],
HL => ["Word_Break", "Line_Break"],
P => ["Jamo_Short_Name", "General_Category"],
Maybe => ["NFKC_Quick_Check", "NFC_Quick_Check"],
EBG => ["Grapheme_Cluster_Break", "Word_Break"],
Combining_Mark => ["Line_Break", "General_Category"],
LVT => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
FO => ["Word_Break", "Sentence_Break"],
H => ["East_Asian_Width", "Jamo_Short_Name"],
Ambiguous => ["East_Asian_Width", "Line_Break"],

Here are all the ones that are Block and Script overlaps:

# Only Script/Block overlap
Malayalam => ["Script", "Block"],
Sundanese => ["Script", "Block"],
Mahajani => ["Script", "Block"],
Pau_Cin_Hau => ["Script", "Block"],
Tibetan => ["Script", "Block"],
Sora_Sompeng => ["Script", "Block"],
Runic => ["Script", "Block"],
Thai => ["Script", "Block"],
Osage => ["Script", "Block"],
Rejang => ["Script", "Block"],
Bassa_Vah => ["Script", "Block"],
Gurmukhi => ["Script", "Block"],
Glagolitic => ["Script", "Block"],
Old_Hungarian => ["Script", "Block"],
Grantha => ["Script", "Block"],
Palmyrene => ["Script", "Block"],
Gothic => ["Script", "Block"],
Lao => ["Script", "Block"],
Nabataean => ["Script", "Block"],
Limbu => ["Script", "Block"],
Old_Persian => ["Script", "Block"],
Phoenician => ["Script", "Block"],
Tai_Le => ["Script", "Block"],
Ol_Chiki => ["Script", "Block"],
Khudawadi => ["Script", "Block"],
Old_Permic => ["Script", "Block"],
Elbasan => ["Script", "Block"],
Duployan => ["Script", "Block"],
Samaritan => ["Script", "Block"],
Syriac => ["Script", "Block"],
Devanagari => ["Script", "Block"],
Greek => ["Script", "Block"],
Lycian => ["Script", "Block"],
Ethiopic => ["Script", "Block"],
Thaana => ["Script", "Block"],
Hatran => ["Script", "Block"],
Siddham => ["Script", "Block"],
Psalter_Pahlavi => ["Script", "Block"],
Kharoshthi => ["Script", "Block"],
Mandaic => ["Script", "Block"],
Newa => ["Script", "Block"],
Kayah_Li => ["Script", "Block"],
Warang_Citi => ["Script", "Block"],
Multani => ["Script", "Block"],
Osmanya => ["Script", "Block"],
Georgian => ["Script", "Block"],
Armenian => ["Script", "Block"],
Sinhala => ["Script", "Block"],
Hiragana => ["Script", "Block"],
Shavian => ["Script", "Block"],
New_Tai_Lue => ["Script", "Block"],
Bamum => ["Script", "Block"],
Cyrillic => ["Script", "Block"],
Old_South_Arabian => ["Script", "Block"],
Myanmar => ["Script", "Block"],
Miao => ["Script", "Block"],
Meroitic_Cursive => ["Script", "Block"],
Tirhuta => ["Script", "Block"],
Coptic => ["Script", "Block"],
Caucasian_Albanian => ["Script", "Block"],
Hanunoo => ["Script", "Block"],
Tamil => ["Script", "Block"],
Avestan => ["Script", "Block"],
Cherokee => ["Script", "Block"],
Inscriptional_Pahlavi => ["Script", "Block"],
Kannada => ["Script", "Block"],
Tifinagh => ["Script", "Block"],
Javanese => ["Script", "Block"],
Inscriptional_Parthian => ["Script", "Block"],
Mro => ["Script", "Block"],
Cham => ["Script", "Block"],
Takri => ["Script", "Block"],
Hangul => ["Script", "Block"],
Old_Turkic => ["Script", "Block"],
Oriya => ["Script", "Block"],
Kaithi => ["Script", "Block"],
Ahom => ["Script", "Block"],
Linear_A => ["Script", "Block"],
Meetei_Mayek => ["Script", "Block"],
Egyptian_Hieroglyphs => ["Script", "Block"],
Ugaritic => ["Script", "Block"],
Buginese => ["Script", "Block"],
Tagalog => ["Script", "Block"],
Anatolian_Hieroglyphs => ["Script", "Block"],
Pahawh_Hmong => ["Script", "Block"],
Tangut => ["Script", "Block"],
Telugu => ["Script", "Block"],
Batak => ["Script", "Block"],
Phags_Pa => ["Script", "Block"],
Vai => ["Script", "Block"],
Mongolian => ["Script", "Block"],
Modi => ["Script", "Block"],
Bhaiksuki => ["Script", "Block"],
Lisu => ["Script", "Block"],
Lydian => ["Script", "Block"],
Brahmi => ["Script", "Block"],
Cuneiform => ["Script", "Block"],
Tai_Viet => ["Script", "Block"],
Syloti_Nagri => ["Script", "Block"],
Chakma => ["Script", "Block"],
Adlam => ["Script", "Block"],
Braille => ["Script", "Block"],
Marchen => ["Script", "Block"],
Deseret => ["Script", "Block"],
Imperial_Aramaic => ["Script", "Block"],
Arabic => ["Script", "Block"],
Khmer => ["Script", "Block"],
Balinese => ["Script", "Block"],
Bengali => ["Script", "Block"],
Bopomofo => ["Script", "Block"],
Tai_Tham => ["Script", "Block"],
Mende_Kikakui => ["Script", "Block"],
Hebrew => ["Script", "Block"],
Meroitic_Hieroglyphs => ["Script", "Block"],
Sharada => ["Script", "Block"],
Khojki => ["Script", "Block"],
Lepcha => ["Script", "Block"],
Saurashtra => ["Script", "Block"],
Tagbanwa => ["Script", "Block"],
Old_Italic => ["Script", "Block"],
Gujarati => ["Script", "Block"],
Carian => ["Script", "Block"],
Old_North_Arabian => ["Script", "Block"],
Ogham => ["Script", "Block"],
Buhid => ["Script", "Block"],
Manichaean => ["Script", "Block"],
Katakana => ["Script", "Block", "Word_Break"],
samcv commented 7 years ago

All of the property names that conflict with values are Bool properties:

«« IDC Conflict with property name [blk]  is a boolean property
«« VS Conflict with property name [blk]  is a boolean property
«« White_Space Conflict with property name [bc]  is a boolean property
«« Alphabetic Conflict with property name [lb]  is a boolean property
«« Hyphen Conflict with property name [lb]  is a boolean property
«« Ideographic Conflict with property name [lb]  is a boolean property
«« Lower Conflict with property name [SB]  is a boolean property
«« STerm Conflict with property name [SB]  is a boolean property
«« Upper Conflict with property name [SB]  is a boolean property

I would like this to be 0th in priority

  1. Property Name (i.e. <:White_Space>, <:Hyphen>)

If we set our preferred properties to be General_Category and Script, then we get 49 property values with overlaps. If we add a third preferred property Grapheme_Cluster_Break we only have 30 remaining.

From here we can resolve Canonical_Combining_Class, and also we should resolve Numeric_Type so that people can use <:Numeric> in their regex (I'm sure that there must already exist code where this is used so we need to make sure this is resolved as well).

Leaving us at a hierarchy of

  1. Property Name (i.e. <:White_Space>, <:Hyphen>)
  2. General_Category
  3. Script
  4. Grapheme_Cluster_Break
  5. Canonical_Combining_Class
  6. Numeric_Type

I am open to adding whichever properties people think most important to the ordered priority list as well.

The ones with overlap remaining after this point:

NU => ["Word_Break", "Line_Break", "Sentence_Break"],
NA => ["Age", "Hangul_Syllable_Type", "Indic_Positional_Category"],
E => ["Joining_Group", "Jamo_Short_Name"],
SP => ["Line_Break", "Sentence_Break"],
CL => ["Line_Break", "Sentence_Break"],
D => ["Jamo_Short_Name", "Joining_Type"],
Narrow => ["East_Asian_Width", "Decomposition_Type"],
NL => ["Word_Break", "Line_Break"],
Wide => ["East_Asian_Width", "Decomposition_Type"],
Hebrew_Letter => ["Word_Break", "Line_Break"],
U => ["Jamo_Short_Name", "Joining_Type"],
LE => ["Word_Break", "Sentence_Break"],
Close => ["Bidi_Paired_Bracket_Type", "Sentence_Break"],
BB => ["Jamo_Short_Name", "Line_Break"],
HL => ["Word_Break", "Line_Break"],
Maybe => ["NFKC_Quick_Check", "NFC_Quick_Check"],
FO => ["Word_Break", "Sentence_Break"],
H => ["East_Asian_Width", "Jamo_Short_Name"],
Ambiguous => ["East_Asian_Width", "Line_Break"],

Any ideas above adding further to the hierarchy (even if they don't have any overlap presently [Unicode 9.0] it could be introduced later) will be appreciated.