handle ZWJ and emoji sequences, don't break identifiers within graphemes

stevengj commented 8 months ago

Closes https://github.com/JuliaLang/julia/issues/40071 — never break an identifier within a grapheme boundary. (Identifiers and graphemes may still only start within the pre-existing allowed set of characters.)

This allows us to accept identifiers like 🏳️‍🌈 == 🏳️‍[ZWJ]🌈 which contain the U+200D "Zero-width joiner" (ZWJ) character, which was not previously allowed since it's in category Cf (Other, format). However, to prevent incomplete graphemes ending with an invisible ZWJ character, we disallow ZWJ at the end of an identifier; similarly for U+200C "Zero-width non-joiner" (ZWNJ).

stevengj commented 8 months ago

I'm not sure if we also need to update the Scheme parser as well? A similar patch for the Scheme parser would be

diff --git a/src/flisp/julia_extensions.c b/src/flisp/julia_extensions.c
index f29e397275..fa04339b5d 100644
--- a/src/flisp/julia_extensions.c
+++ b/src/flisp/julia_extensions.c
@@ -178,6 +178,7 @@ JL_DLLEXPORT int jl_op_suffix_char(uint32_t wc)
 static int never_id_char(uint32_t wc)
 {
      utf8proc_category_t cat = utf8proc_category((utf8proc_int32_t) wc);
+     if (wc == 0x200c || wc == 0x200d) return 0; // ZWNJ/ZWJ is allowed in emoji, despite being in category Cf
      return (
           // spaces and control characters:
           (cat >= UTF8PROC_CATEGORY_ZS && cat <= UTF8PROC_CATEGORY_CS) ||
@@ -329,6 +330,8 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
     if (!iscprim(args[0]) || ((cprim_t*)ptr(args[0]))->type != fl_ctx->wchartype)
         type_error(fl_ctx, "accum-julia-symbol", "wchar", args[0]);
     uint32_t wc = *(uint32_t*)cp_data((cprim_t*)ptr(args[0])); // peek the first character we'll read
+    uint32_t prev_wc = wc;
+    utf8proc_int32_t grapheme_state = 0;
     ios_t str;
     int allascii = 1;
     ios_mem(&str, 0);
@@ -345,9 +348,10 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
         }
         allascii &= (wc <= 0x7f);
         ios_pututf8(&str, wc);
+        prev_wc = wc;
         if (safe_peekutf8(fl_ctx, s, &wc) == IOS_EOF)
             break;
-    } while (jl_id_char(wc));
+    } while (jl_id_char(wc) || !utf8proc_grapheme_break_stateful(prev_wc, wc, &grapheme_state));
     ios_pututf8(&str, 0);
     return symbol(fl_ctx, allascii ? str.buf : normalize(fl_ctx, str.buf));
 }

although this doesn't implement the rule of forbidding ZWJ at the end of an identifier.

c42f commented 8 months ago

Cool this looks sensible. I worry about the extra cost of calling into the C code per character for such a relatively hot (presumably) code path. What does the script in test/benchmark.jl say about performance impact?

Personally I'm not worried about keeping the scheme parser perfectly up to date. Primarily it only needs to be able to parse Base until we figure out the bootstrapping story and can eventually delete it.

(Side note... we really should bring Base.is_id_char() into JuliaSyntax at some point --- this library does aim to be installable as a normal package on both old and new versions of Julia and behave the same everywhere...)

codecov[bot] commented 8 months ago

Codecov Report

Merging #372 (75a1fc2) into main (1b048aa) will decrease coverage by 0.01%. Report is 2 commits behind head on main. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #372      +/-   ##
==========================================
- Coverage   96.58%   96.57%   -0.01%     
==========================================
  Files          14       14              
  Lines        4160     4179      +19     
==========================================
+ Hits         4018     4036      +18     
- Misses        142      143       +1

Files	Coverage Δ
src/tokenize.jl	`99.08% <100.00%> (+0.01%)`	:arrow_up:

... and 2 files with indirect coverage changes

stevengj commented 8 months ago

Here are the results of test/benchmark.jl:

main branch, minimum times:

ParseStream: 205.944 ms
GreenNode: 214.803 ms
SyntaxNode: 403.870 ms
Expr: 472.283 ms
flisp: 3.644 s

sgj/ZWJ branch:

ParseStream: 179.254 ms
GreenNode: 241.707 ms
SyntaxNode: 427.242 ms
Expr: 496.298 ms
flisp: 3.652 s

It looks like a 15% slowdown on ParseStream, a 10% slowdown on GreenNode, and a 5% slowdown on SyntaxNode?

stevengj commented 8 months ago

As a more realistic data point, here is the cost to parse all of base (Julia 1.9.2) with the following code:

using JuliaSyntax, BenchmarkTools
include(dirname(pathof(JuliaSyntax)) * "/../test/test_utils.jl")

function bench_parse_all_in_path(basedir)
    for filepath in find_source_in_path(basedir)
        text = read(filepath, String)
        stream = ParseStream(text)
        parse!(stream)
        ex = build_tree(Expr, stream, filename=filepath)
    end
end

p = joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base")
p = isdir(p) ? p : joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "src", "base") 

@btime bench_parse_all_in_path($p)

which gives:

main branch: 543.924 ms (2117659 allocations: 139.46 MiB)
sgj/ZWJ branch: 558.836 ms (2719985 allocations: 148.77 MiB) i.e. a slowdown of < 3%.

My conclusion is that the slowdown in realistic workloads will be negligible, especially since parsing is usually only a very small portion of load time.

c42f commented 8 months ago

I don't love these slowdowns - we could live with them of course, but it's very significant for such a small change and it does show this is a relatively hot code path (no surprise I guess).

Luckily there's probably a super simple optimization: if you throw in an isascii() before you hit the slow path of isgraphemebreak!() it's probably almost free for most code?

c42f commented 8 months ago

Another thing - while 3% is somewhat insignificant in building Base, and the slowdown for GreenNode and SyntaxNode may seem irrelevant it's not exactly: it's quite relevant to surface-level tooling which works across a whole registry - for example package analyzer.

Keno commented 8 months ago

Just hardcore the answer for ascii characters? That feels like is should solve 99% of the perf issues.

stevengj commented 8 months ago

Following the suggestion by @c42f and @Keno, I added an optimization for the common case of ASCII identifiers. The minimum times are now (on a different computer than above):

sgj/ZWJ branch:

ParseStream: 114.662 ms ms
GreenNode: 134.884 ms
SyntaxNode: 247.248 ms
Expr: 300.598 ms
flisp: 2.306 s

main branch:

ParseStream: 115.894 ms
GreenNode: 136.842 ms
SyntaxNode: 252.208 ms
Expr: 304.795 ms
flisp: 2.342 s

Voilá, no performance penalty! (It's actually very slighly faster, since we no longer call is_identifier_char at all for ASCII chars, avoiding a lookup of the Unicode category.)

stevengj commented 8 months ago

I suppose a julia/NEWS.md item will be added when the JuliaSyntax version is bumped in the julia repo? Is there a place to make a note of that in the meantime?

JuliaLang / JuliaSyntax.jl

handle ZWJ and emoji sequences, don't break identifiers within graphemes #372

Codecov Report