:ignoremark is overzealous

When using :ignoremark, all codepoints beyond the first in a grapheme cluster are ignored. This causes problems with clusters whose second codepoints and beyond are not combining marks.

For example, the following two codes produce matches:

"🇩🇪" ~~ /:ignoremark '🇩🇰'/
"ᄼᆡᇫ" ~~ /:m 'ᄼᆢ'/

But logically, they should not.

I'm not sure whether to consider this a Raku thing (needing to refine the roast), a MoarVM thing (it seems things could be solved by adjusting the ord_getbasechar() function), or a Rakudo thing (it manually handles things using .samemark, for instance). It's probably some combination of the three, but the latter two will depend on the first, hence opening up this problem solving issue.

The question really boils down to this: what should :ignoremark ignore?

The question isn't that simple. What follows is a (perhaps overkill) discussion of the issues. To me, there are are some things that absolutely must be ignored or should never be ignored, from a logical perspective, but there are others that exist in a grey uncertain area for me. In descending order of certainty (imho, of course, ymmv), I get:

Codepoints with the Mn property should be ignored These are the traditional diacritics that are included in roast, and align with virtually all examples of the functionality. Maintaining this is a must.
Codepoints with an L* property should not be ignored
Though they cluster for a variety of reasons, and some people might use Lm in a diacritical-like fashion, these are almost always seen as divisible, equal elements.
Codepoints with an So property should not be ignored.
This includes sequences such as the regional indicators. Similar to L*, if clustered, they are probably equalish elements, though may not be logically divisible.
Codepoints with an Sm property should probably be ignored For instance, this includes the variant selectors, such as U+FE0F to indicate emoji presentation.
Codepoints with an Mc property should probably not be ignored. Most of these are found with Indic scripts and I'll admit to not being super well versed in these or the expectations of users of them. I don't think users would expect 'पि' ~~ /:m पॣ/ to produce a successful match. Conceptually, पि (Lo Mc, /pi/ ) is very much प (Lo, /pa/) followed by as इ (Lo, /i/) where the इ replaces प's inherent vowel. To represent /pa.i/, पइ (Lo Lo) would be used. That Indic scripts have some Mn characters that seem to be used more like Latin's diacritics makes me think this is right. It also seems to be what might be expected in the few non-Indic scripts that use it like Maio where vowels are indicated with Mc, and tone with Mn.
I'm inclined to say Me should be ignored.
It's really a mixed bag of random combiners, but they feel closer generally to Mn than Mc or Sk. There's only 13 of them.

Where stuff gets really … puzzling … is in a sequence like 💇🏾‍♀️, composed of

So - HAIRCUT (U+1F487)
Sk - EMOJI MODIFIER FITZPATRICK TYPE-5 (U+1F3FE)
Cf - ZERO WIDTH JOINER (U+200D)
So - FEMALE SIGN (U+2640)
Mn - VARIATION SELECTOR-16 (U+FE0F)

These are specially defined by Unicode in an emoji list (and, props to @samcv ++, Raku recognizes them as such, so '💇🏾‍♀️'.chars returns 1). You'd think :ignoremark should ignore everything after the Sk (even though Sk elsewhere probably should count on its own), and that would make a lot of sense… until you get 🧑🏽‍❤️‍🧑🏾, which is defined as

So - ADULT (U+1F9D1)
Sk - EMOJI MODIFIER FITZPATRICK TYPE-4 (U+1F3FD)
Cf - ZERO WIDTH JOINER (U+200D)
So - HEAVY BLACK HEART (U+2764)
Mn - VARIATION SELECTOR-16 (U+FE0F)
Cf - ZERO WIDTH JOINER (U+200D)
So - ADULT (U+1F9D1)
Sk - EMOJI MODIFIER FITZPATRICK TYPE-5 (U+1F3FE)

Here, it's more clear that each element joined by the zero width joiner is a distinct element (and in fact, (97,8205,98).chrs.join.chars returns just two, with the ZWJ being counted as a cluster element with the first letter). Is there an expected output sequence for '🧑🏽‍❤️‍🧑🏾'.samemark('á')? Currently, it's 🧑́, and that doesn't feel right, but OTOH, I really have no idea what to expect with it either. Probably either left untouched, 🧑́‍❤́‍🧑́, or 🧑́🏽́‍❤́‍🧑́🏾́.

phew that was a long discussion.

tl;dr. Emoji suck and unnecessarily complicate Unicode. Long live emoticons ^_-

What ignoremark should ignore depends on its function in the grapheme, at least as far as MoarVM is concerned.

See Table 1b here https://www.unicode.org/reports/tr29/tr29-15.html#Default_Grapheme_Cluster_Table

So originally, before prepend character marks were around, MoarVM only supported: base? ( Mark | ZWJ | ZWNJ )+ and extended_base? ( Mark | ZWJ | ZWNJ )+ as the basechar.

However many years ago, I added support for prepend marks, which changed the way graphemes are stored slightly. So instead of always assuming the first codepoint in a grapheme is the base character, it stores a index.

https://github.com/MoarVM/MoarVM/blob/6bf54d784e38268a37e97d702b9b8ad1d3116069/src/strings/ops.c#L1136

return ord_getbasechar(tc, synth->codes[synth->base_index]);

This is the unicode grapheme statement which includes prepend (of course we aren't talking about emoji yet)

( CRLF
| Prepend* ( Hangul-syllable | !Control )
  ( Grapheme_Extend | Spacing_Mark)*
| . )

So, it's easy in the case of prepend* + base + extend*, but then we have hangul which is made up of multiple sounds https://en.wikipedia.org/wiki/Hangul#/media/File:Hangeul_letter_order.svg I don't speak Korean so I don't think I could best speak on how we should handle this. But on first thought i'd think we should match any of the codepoints making up a Hangul grapheme.

So it seems we are looking for "visible major element" (though this is more complicated with Emoji as you pointed out)

On the Emoji part: this is of course more complicated. Since unicode is generally not concerned with how text appears. The fact that '🧑🏽‍❤️‍🧑🏾' as your example looks like three characters but is actually 1.

So what should we do with this?

I think we should have a flag set on graphemes in MoarVM which are hangul or emoji sequences. In these we could do an exhaustive search of the codepoints in the grapheme, whereas if it's a standard cluster (with only 1 base codepoint) we can continue with the current method. (of course it will require changing many functions in MoarVM such as the ignoremark string indexing function.)

Of course this will need to imagine how samemark would work for Hangul and emoji. For emoji I think it's irreconcilable... I would give up, unless anyone has any good suggestions. For Hangul I would defer to a native Korean speaker for possibilities. I don't know replacing any of these with a single argument (as samemark currently only takes a single argument) makes sense.

Edit: By give up on the Emoji part, I mean we shouldn't focus too much on that. Of course we can look into it a bit... but I think it is not worth designing everything around that.

I'd reckon that avoiding anything special on emoji is probably best. They feel sufficiently ~arbitr~ idiosyncratic that manipulation with them is probably best left to modules.¹

I'm not a native Korean speaker, but I did take a semester of it back in college (so standard disclaimers of grains of salt, etc). Its encoding is … special, of course. The only marks that would be semi-expected for it are the old tonal markers U+302D and U+302E, but those apply syllable-wide (so you'd have a sequence like ㅇ ㅣ ㅇ 〮 → 잉〮 ) and are barely used (None of my fonts really support it and I couldn't find any non-image uses of it only). That said, I don't think any Korean would expect '불고기' ~~ /:m '밥'/ to produce a successful match, but it currently does.

Looking at the most current TR-29, clusters are defined as

crlf
| Control
| precore* core postcore*

Where core is

hangul-syllable
| ri-sequence      # regional identifiers (ri
| xpicto-sequence  # bane-of-ex– emojis
| [^Control CR LF]

We effectively currently ignore precore and postcore, but also ignore all but the first codepoint in core. I think a suitable adjustment would be to consider all of core as the base character, which would basically be what you propose — plus treating regional indicators and emoji combinations as a single block.

The only wrinkle might be in postcore which includes both Mc and Mn. I don't think that makes a ton of sense to the typical user of Indic scripts to ignore Mc (they have also Mn characters that apply to the syllable as a whole similar to Hangul's tonal marking) but I don't use those languages. At the same time, many non-Indic languages also use Mc where I don't feel like it should be mixed into core. But, it's probably easiest to ignore Mc and direct users to UAX #29 as our bases, leaving it to modules to provide more tailored, language-specific handling.². OTOH, I feel like erring on the side of of including Mc could avoid large amounts of unexpected information loss might be best,³ but it also loses the simplicity of just saying base = core, marks = pre/postcore.

I'm guessing the idea of a flag for characters could be generalized to clusters with * > 1* base codepoints, yeah? That way if Unicode extends its definition ofcore`, things could be more easily adjusted.

So a module might have options to :ignore-skin-tone, :ignore-gender, :ignore-hair-color for people, but :ignore-color for hearts, etc. Plus, who knows what crazy emoji sequence concoctions Unicode will come up with five years from now. Too unstable for Raku/MoarVM's core to be concerned with it.
To be fair, long term a various module-based solutions will probably be needed to handle all of the oddities that languages have, even for Latin-scripted languages have enough peculiarities that :i and :m break down. (Turkish, being an obvious example for :i, but also languages like Slovak where ch is considered one letter, or Spanish where ñ and n ought to be considered distinct with :m, but u, ü, and ú oughtn't)
Unlike Semitic languages where the absence of vowels doesn't seem to cause problems, I don't know to what extent other languages are readable without vowel indicators. That Urdu requires marking more vowels than Arabic and is basically the same language as Hindi tells me they probably aren't. But from a quick search online, Thai, at least, apparently is fairly readable without them.

Is there an option here to leave :ignoremark as is (as a best effort), and implement various gradations of 'zealousness' using a numerical indicator? So for example, something like:

> #hypothetical
> "🇩🇪" ~~ /:ignoremark '🇩🇰'/;
Unicode U+D83C U+DDE9 U+D83C U+DDEA

...remains unchanged and matches Unicode U+D83C U+DDE9 U+D83C U+DDEA, but:

> "🇩🇪" ~~ /:ignoremark1 '🇩🇰'/;
Unicode U+D83C U+DDE9 U+D83C

> "🇩🇪" ~~ /:ignoremark2 '🇩🇰'/;
Unicode U+D83C U+DDE9

> "🇩🇪" ~~ /:ignoremark3 '🇩🇰'/;
Unicode U+D83C

...behaves by successively stripping off one Unicode codepoint?

Is there an option here to leave :ignoremark as is (as a best effort), and implement various gradations of 'zealousness' using a numerical indicator? So for example, something […that…] behaves by successively stripping off one Unicode codepoint?

Well, I guess first it's important to see what's actually going on with :ignoremark. The idea, of course, is to compare only base characters. For most western scripts, this works splendidly by only considering the first codepoint (in decomposed form) and ignoring subsequent ones (as historically multi code point sequences were virtually always a single letter + one or more marks). So if I do:

'á' ~~ /:ignoremark â/

It first decomposes á _(U+00E1) into a _U+0061 plus ´ _(U+0301) and â into a _U+0061 plus ^ _(U+0302), and and then compares only the first letter, or a. They match and the match uses the original string á _(U+00E1). For Latin / Greek / Cyrillic / Arabic / Hebrew this is all you need.

"🇩🇪" ~~ /:ignoremark3 '🇩🇰'/;

The flag sequences follows the same logic, and each begins with the regional indicator D which on its own is fairly nonsensical. Nonetheless, each begins with the same initial regional indicator, and so it matches, producing 🇩🇪 _{(1F1E9 1F1EA)}. Ditto for Korean, where 가 _{(U+AC00 ≍ U+1100 U+1161)} will produce a successful match with 꽩_{(U+AF69 ≍ U+1100 U+1100 U+1160 U+1164 U+11AB U+11BD)} as both have the decomposed initial codepoint U+1100.

Using :ignoremark[1..9] I don't think would make a lot of sense in either system, because the number of marks/combing elements are highly variable, and both retaining or stripping X codepoints make little sense for those. The Brahmic systems — based on consultation with native speakers — function essentially like Korean, although the uniprops are quite different. Emoji sequences are ... even messier still since they can be effectively two base characters with separate "marks", but conjoined (if one person is marked with skinton and gender, and the other only gender, do you strip skintone from one and gender from the other or do you strip just the gender from the second when stripping a single character?).

That said, the idea of having different levels of ignoring is not an unreasonable one. But it's almost certainly better left to module space (and with some of the regex/token modules I've done, I wouldn't be surprised if I made one of for it).

Admittedly this isn't my area of expertise, but recently I had occasion to parse some Tanach Hebrew on SO:

https://stackoverflow.com/a/66540269/7270649

I was curious when I noticed the title and first line of Genesis differed in character markings. Here :ignoremark was my friend, and was able to match the two words:

say "1:\t", $/ if "בראשית" ~~ m/ בְּרֵאשִׁית /;
say "2:\t", $/ if "בראשית" ~~ m:ignoremark/ בְּרֵאשִׁית /; #OUTPUT:  ｢בראשית｣
say "3:\t", $/ if "בְּרֵאשִׁית" ~~ m/ בראשית /;
say "4:\t", $/ if "בְּרֵאשִׁית" ~~ m:ignoremark/ בראשית /; #OUTPUT:  ｢בְּרֵאשִׁית｣

Unfortunately, that's where the trail goes cold. I wanted to look at the codepoints stripped off by :ignoremark but have been unable to do so. No doubt you'll have a solution in a jiffy, but the question isn't one of solving this particular problem. The question is whether low-level tools exist in general to onboard linguistic researchers and get them using the Raku programming language (as opposed to some other programming language).

EDITED BELOW: So whatever you do decide, it seems that there could be a need for :adverb -level augmentation here. One thought now (taking your excellent summary into account) is that the base character has to be better defined, with recognition that the base character isn't always the first character. Also :ignoremark could be augmented to something like :ignoremark[-1,1], where no postfix means "match against the base character only (again, not necessarily the first character), 1 means ignore any suffixing characters, and -1 means ignore any prefixing characters (if any). EDITED ABOVE.

And for some sort of finer "sequential stripping" (element-wise decomposition), maybe what's really needed is a different adverb, such as :decompose[ (-∞..-1).Seq, 0, (1..∞).Seq ], or something similar. Because :ignoremark indicates a negative action ("ignore"), :decompose could indicate a positive action, as in ":decompose[ (-∞..-1).Seq, 0] means keep only prefix and base characters".

[edited to use ∞ symbol]

Codepoints with an Mc property should probably not be ignored. Most of these are found with Indic scripts and I'll admit to not being super well versed in these or the expectations of users of them. I don't think users would expect 'पि' ~~ /:m पॣ/ to produce a successful match. Conceptually, पि (Lo Mc, /pi/ ) is very much प (Lo, /pa/) followed by as इ (Lo, /i/) where the इ replaces प's inherent vowel. To represent /pa.i/, पइ (Lo Lo) would be used. That Indic scripts have some Mn characters that seem to be used more like Latin's diacritics makes me think this is right. It also seems to be what might be expected in the few non-Indic scripts that use it like Maio where vowels are indicated with Mc, and tone with Mn.

As a (novice) student of Sanskrit, I think I can confirm your intuition: the vowel diacritics should not be ignored by :ignoremark.

The Mn diacritics for Devanagari include the anusvara ं, but not visarga ः, which is Mc like the vowel diacritics. This is tricky, as they have a very similar function. Obviously marks change the sounds and meanings of words in all languages that use them, or there would be no point in having them. :ignoremark essentially constructs a somewhat arbitrary set of meaning distinctions it’s okay to ignore, without knowledge of the context in which those meaning differences are allegedly going to be unimportant. This includes differences like Italian è (is) and e (and), German Apfel (apple) and Äpfel (apples) … but in the latter case /:ignoremark Apfel/ would under no circumstances match Aepfel, linguistically equivalent to Äpfel.

I think it would be okay for :ignoremark to ignore the anusvara but not the visarga. It’s arguably less disruptive to Sanskrit at least (I don’t know Hindi or other languages with Indic scripts) than the above examples from Italian and German.

But I’m increasingly wondering whether :ignoremark is a good idea at all. I suppose it’s too late to take it out entirely? Or could a fully-flexibly customizable version be added where you can actually choose which codepoints to ignore down to the level of the individual property?

I’m increasingly wondering whether :ignoremark is a good idea at all.

Have you explored its discussion over the years?

A couple obvious places to look are a google for ignoremark and a search of Liz's IRC logs eg oldest first, #perl6.

Maybe review those and summarize what you find in this issue?

I suppose it’s too late to take it out entirely?

I'd say Raku philosophy is that nothing's too late if it's the right thing to do.

Or could a fully-flexibly customizable version be added?

Aiui :ignoremark is currently just True or False. It could presumably be turned into False or a truthy value to enable more nuance.

Aiui, in general, at least in the near term (next few years), this sort of thing is supposed to be the province of userland modules. That is to say, the available options for the truthy value, and the meaning of those options, would be determined by one or more userland modules.

That said, I'm just hand waving, and if doing this means it has to be an nqp module, or a MoarVM modules, well, as a starting point it would be good to nail down what kind of userland module would be needed to pull this off in a reasonable manner.

It may be that doing this reasonably cleanly also requires completion of the overhaul of the Rakudo frontend, including plugging userland modules into the compiler, which is, if I understand correctly, part of RakuAST. This might be a year or three away.

[let users] choose which codepoints to ignore down to the level of the individual property?

Ideally it would be whatever userland modules choose to make available as options.

At least, this is my understanding of the overall situation for Raku(do) for at least much of this decade.

Raku / problem-solving

:ignoremark is overzealous #276

So what should we do with this?