enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.36k stars 322 forks source link

Code editing becomes unstable if there are emojis like ๐Ÿ—๏ธ or ๐Ÿ˜€ but not 1๏ธโƒฃ #10678

Open vitvakatu opened 3 months ago

vitvakatu commented 3 months ago

The issue initially found when observing documentation panel bug, but it also happens in regular code editor. It seems to be caused by incorrect handling of Unicode inside of engine. Example of usage in our tests:

diff --git a/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/EditorOpsSpec.scala b/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/EditorOpsSpec.scala
index 38cdb26b65..0949215686 100644
--- a/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/EditorOpsSpec.scala
+++ b/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/EditorOpsSpec.scala
@@ -11,7 +11,7 @@ class EditorOpsSpec extends AnyFlatSpec with Matchers with EitherValues {

   "An editor" should "be able to apply multiple diffs" in {
     //given
-    val signaturePosition = Range(Position(2, 12), Position(2, 13))
+    val signaturePosition = Range(Position(2, 13), Position(2, 14))
     val signatureDiff     = TextEdit(signaturePosition, "arg")
     val bodyPosition      = Range(Position(2, 23), Position(2, 24))
     val bodyDiff          = TextEdit(bodyPosition, "arg")
@@ -21,7 +21,7 @@ class EditorOpsSpec extends AnyFlatSpec with Matchers with EitherValues {
     //then
     result.map(_.toString) mustBe Right("""
                                           |main =
-                                          |    apply = arg f -> f arg
+                                          |    apply = ๐Ÿ—arg f -> f arg
                                           |    adder = a b -> a + b
                                           |    plusOne = apply (f = adder 1)
                                           |    result = plusOne 10
diff --git a/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/TestData.scala b/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/TestData.scala
index 75a76e0d79..c197d343f5 100644
--- a/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/TestData.scala
+++ b/lib/scala/text-buffer/src/test/scala/org/enso/text/editing/TestData.scala
@@ -7,7 +7,7 @@ object TestData {
   val code =
     """
       |main =
-      |    apply = v f -> f v
+      |    apply =๐Ÿ—v f -> f v
       |    adder = a b -> a + b
       |    plusOne = apply (f = adder 1)
       |    result = plusOne 10

My idea is to add an emoji https://unicodeplus.com/U+1F5DD to the code. I want to insert some text directly after this emoji. It has size of 2 UTF-16 codeunits (2x2 bytes). I changed offset as weโ€™re interpreting this in the GUI: as emoji takes 2 codeunits, and I replaced a single space character, I shifted edit one codeunit (13 instead of 12). However, it seems the engine code interprets this emoji as a single codeunit, so the text get inserted one character after, on an incorrect index. So it seems to me engine code works on Unicode UTF-16 code points, not code units.

Internal discussion aviable at https://discord.com/channels/401396655599124480/1266028137175584768/1266028139037982720

hubertp commented 3 months ago

I updated the description (better to have tickets self-contained).

4e6 commented 3 months ago

The old key emoji ๐Ÿ—๏ธ \uD83D\uDDDD posted here has a variation selector \uFE0F telling how to render the value. Java treats string \uD83D\uDDDD\uFE0F as two code points: \uD83D\uDDDD and \uFE0F. If you try to make a text edit and add foo after the key (at the Position(0,1)), the result will be the uD83D\uDDDDfoo\uFE0F.

The fix would be to strip the user input from variation selectors. Although it won't be the exact byte sequence, it will be visually the same.

4e6 commented 3 months ago

I'll see if I can use the icu UTF16 to detect the emoji position properly.

(Note to self) Check how JS treats the \uD83D\uDDDD\uFE0F string.

enso-bot[bot] commented 3 months ago

Dmitry Bushev reports a new STANDUP for yesterday (2024-08-01):

Progress: [10678] Started working on the issue. Managed to find the case when an emoji with a modifier is treated by two symbols in Java string. Created the test case reproducing the issue. It should be finished by 2024-08-07.

Next Day: Next day I will be working on the #10678 task. Continue working on the task

enso-bot[bot] commented 3 months ago

Dmitry Bushev reports a new STANDUP for today (2024-08-02):

Progress: [10678] Playing with ICU library trying to see if it can detect the position of emoji with modifier correctly. [10735] Updated the SBT build to accommodate the changed npm configuration. Fixed the ydoc-server-polyglot esbuild. It should be finished by 2024-08-07.

Next Day: Next day I will be working on the #10678 task. Continue working on the task

enso-bot[bot] commented 2 months ago

Dmitry Bushev reports a new STANDUP for yesterday (2024-08-05):

Progress: [10678] Playing with ICU iteration capabilities to detect emojis correctly. Implemented draft version of iterator capable of iterating emojis. Started testing It should be finished by 2024-08-07.

Next Day: Next day I will be working on the #10678 task. Continue working on the task

enso-bot[bot] commented 2 months ago

Dmitry Bushev reports a new STANDUP for today (2024-08-06):

Progress: [10678] Looking into the string implementation in JS. Negotiated with the gui team to work with the code units and not with the code points. Started updating the text editing logit to support code units. It should be finished by 2024-08-07.

Next Day: Next day I will be working on the #10678 task. Continue working on the task

enso-bot[bot] commented 2 months ago

Dmitry Bushev reports a new STANDUP for yesterday (2024-08-07):

Progress: [10678] Implemented text editor support with ranges measured in the Unicode code units. Updated tests. Created the PR. It should be finished by 2024-08-07.

Next Day: Next day I will be working on the #10678 task. Continue working on the task