Better display of Unicode surrogate pairs on non-Unicode regex flavors (Feature Request)

jhmaster2000 commented 1 year ago

Making a new issue for this since I can't re-open the original.

I understand that it's how the regexes are working under the hood, but it still isn't intuitive to hover over the second surrogate of the character when the character itself is rendered as one, it is literally a 1-pixel wide hitbox to get the hover for the 2nd surrogate.

What if instead you made it display both surrogates for hovering the multi-surrogate character? i.e. hovering over the example character used 𐌐 in a non-Unicode regex flavor could show a unified hover like this:

Text - matches literally (case-sensitive) the characters: � with index 55296₁₀ (D800₁₆ or 15400₈) and � with index 57104₁₀ (DF10₁₆ or 157420₈)

This way the regex decomposition of the displayed Unicode characters is easily viewable and also prevents potentially misleading users unaware of Unicode semantics into thinking their Unicode characters equal to only the first surrogate they see on the current hover.

Originally posted in https://github.com/firasdib/Regex101/issues/2061#issuecomment-1565036374

firasdib commented 1 year ago

I agree that the hitbox is not ideal, this is a side effect of the new codemirror version being used.

The question is how correct we want to be in regards to how the regex is actually interpreted, vs what we see, and I am torn on this.

jhmaster2000 commented 1 year ago

Well here's some possible solutions I can think of:

Only display split surrogates on non-Unicode flavors and only the actual char on Unicode flavors (as originally suggested in this issue above)
Combine with 1, with it as default behavior, but add a setting "Force Unicode character display on hover of non-Unicode regex flavors", which makes it behave like on Unicode flavors regardless of flavor. (Off by default)
When a surrogate pair character is inserted into the regex field, automatically convert it to the split surrogate form at text-level, no special hover needed then, but might be confusing for people unaware of Unicode semantics and they might think it's a bug, not an ideal solution and not what I'd recommend, but just listing it as a possibility since there are some places which take this approach, but it's generally as fallback for lack of support for Unicode display, not quite the case here.
A setting-less alternative for 1+2, keeping Unicode flavors as is, but on non-Unicode flavors, behave like the setting on 2 is enabled (thus display the proper char as it were a Unicode flavor) but because it isn't a Unicode flavor, attach a secondary message to the hover for how it will actually be interpreted by that regex flavor, example:

Text - the character 𐌐 with index 66320₁₀ (10310₁₆ or 201420₈) ~~-----------------------------------------------~~ This regex flavor doesn't support Unicode, so it will match this character as � with index 55296₁₀ (D800₁₆ or 15400₈) followed by � with index 57104₁₀ (DF10₁₆ or 157420₈) literally (case sensitive)

Although one problems I imagine here is this is very long so it might be too verbose to fit smaller screens or look nice regardless of screen, so maybe a rework of the character hovers to make them more concise with a design similar to the one of the matches hovers in the test string could take place to make this option work better, proof of concepts below:

Regular non-special characters (below U+FFFF, regex-flavor-agnostic)

Character: K (U+004B) ~~----------------------------~~ Case: <Insensitive/Sensitive> Matches: Literally
High non-special characters (above U+FFFF)
- Non-Unicode regex flavors
  
  Character: 𐌐 (U+10310) ~~----------------------------~~ Case: <Insensitive/Sensitive> Matches: � (U+D800) and � (U+DF10)
- Unicode regex flavors: Same as regular characters, only the U+ would have 5-6 digits

Note that I purposefully removed decimal and octal "indexes" (actually called codepoints officially btw) because octal is pretty much obsolete and decimal isn't really used much either, especially not outside the ASCII range, and those three are definitely even less commonly used together, so if really deemed necessary, decimal and octal could be added as an option to change what the parens next to the char use, at least how I would do it.

All the solutions above also assume the hover hitbox consistently behaving like it does on Unicode regex flavors, regardless of the flavor, meaning without the 1px trailing hitbox for the 2nd surrogate, as it shouldn't be necessary with one of those fixes in place.

Personally I prefer solution 4 with the full hover format rework, as its the more robust one (in my opinion), but it's also the more complex one to implement so I also brought up 1, 2 and 3 as simpler alternatives.

firasdib commented 1 year ago

I believe this is a generally complex problem to solve.

Given /./g and the string 𐌐, the following matches are acquired: [ "\ud800", "\udf10" ]

However, visually, on your screen, you only see one character, namely 𐌐. The question then becomes, should you see what you input, or what you're matching? The match information will show the output directly from the engine, which will show the surrogates, but CodeMirror has no notion of this, and will show the combined character, because that's all it knows.

The same problem goes for the regex 𐌐?, which is technically 3 characters, but code mirror is only aware of 2, and thus can't color it in properly.

I don't think there can be a good solution until someone makes CodeMirror surrogate-pair-aware.

firasdib commented 1 year ago

Had another moment to think about it. The only solution I can think of is to inject U+200C in the text when non-unicode mode is enabled, but it generally seems like a bad idea.

Treating all surrogate pairs as one character will mess up the explanations of expressions such as 𐌐?.

jhmaster2000 commented 1 year ago

Is the side explanation view tied to the hovers? I don't think there's anything the side explanation panel needs to or even could change in regards to Unicode, all my suggestions were with only the hover displays in mind.

firasdib commented 1 year ago

They are all connected, as they use the same source of data. The regex lexer is what supplies the entire interface with the information you see, including the hover.

firasdib commented 1 year ago

I have an idea for solving this, which will make support for this complete, but requires a lot of work and database changes. I will try to get to it soon, but it might be a while before I have a chance to test it all and deploy the changes.

firasdib commented 1 year ago

This is now live.

firasdib / Regex101

Better display of Unicode surrogate pairs on non-Unicode regex flavors (Feature Request) #2082