JuliaLang / JuliaSyntax.jl

The Julia compiler frontend
Other
267 stars 32 forks source link

re-implement Base Unicode features using utf8proc_jll #381

Open stevengj opened 8 months ago

stevengj commented 8 months ago

Closes #377.

stevengj commented 8 months ago

(It's maybe 1% faster on tests/benchmark.jl compared to main.)

codecov[bot] commented 8 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (cfdac5f) 96.57% compared to head (1506668) 96.64%. Report is 8 commits behind head on main.

:exclamation: Current head 1506668 differs from pull request most recent head c820724. Consider uploading reports for the commit c820724 to get more accurate results

Files Patch % Lines
src/unicode.jl 98.11% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #381 +/- ## ========================================== + Coverage 96.57% 96.64% +0.06% ========================================== Files 14 15 +1 Lines 4179 4198 +19 ========================================== + Hits 4036 4057 +21 + Misses 143 141 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stevengj commented 8 months ago

Looks like utf8proc_jll version 2.9.0 doesn't work with Julia < 1.6 due to changes in BinaryBuilder or something?

How badly do you want to support ancient Julia versions?

c42f commented 8 months ago

How badly do you want to support ancient Julia versions

Personally? I'd prefer to keep supporting them but I don't have a super strong opinion as long as we support 1.6 and above. But tooling maintainers did want it to work on older versions I think. @davidanthoff @pfitzseb ?

If we can't do it easily, can we continue to support older versions by calling into Base.is_id_char() for now?

stevengj commented 8 months ago

If we can't do it easily, can we continue to support older versions by calling into Base.is_id_char() for now?

That will silently give different results on older versions of Julia, which also seems bad?

(If you want to go that route, it would be easier to simply lower the version requirement for utf8proc_jll?)

Though I think JLL's were not supported at all before Julia 1.3 or so? I really don't see how to maintain compatibility with Julia 1.0 without keeping a separate branch around with a different Project.toml that omits the utf8proc_jll requirement.

c42f commented 8 months ago

I really don't see how to maintain compatibility with Julia 1.0 without keeping a separate branch around with a different Project.toml that omits the utf8proc_jll requirement.

Oof. Yes of course - the lack of optional dependencies in early Julia versions makes this really awkward. Hmm.

Now I'm looking at porting the necessary parts of utf8proc.c to Julia. The library doesn't look large or scary and I'm tempted to give it a go.

pfitzseb commented 8 months ago

For the VS Code extension, any non-Julia dependencies are a little annoying because we're trying to vendor everything. If porting utf8proc to Julia is feasible, then that'd be ideal I think.

stevengj commented 8 months ago

Now I'm looking at porting the necessary parts of utf8proc.c to Julia. The library doesn't look large or scary and I'm tempted to give it a go.

Yes, it shouldn't be too bad if you write a little script to convert the data tables in utf8proc_data.c (from C arrays to Julia arrays), something like:

utf8proc_data = read("utf8proc_data.c", String)
jl_data = replace(utf8proc_data,
    # arrays
    ",\n"=> ", ", "};"=>"]",
    r"static const [A-Za-z0-9_]+ ([A-Za-z0-9_]+)\[\] = \{\n" => s"const \1 = [ ",

   # utf8proc_property constructors
   "{" => "Property(", "}," => "),",

    # constants
    "UINT16_MAX" => typemax(UInt16))
write("utf8proc_data.jl", jl_data)

along with copying the definitions of UTF8PROC_CATEGORY_CC etcetera.

Note that Julia does not support bitfield structs, so you'd have to emulate them for re-defining utf8proc_property_t unless you want to waste a lot of memory for full-width fields.

That being said, this seems like a lot of effort (and an ongoing maintenance headache since utf8proc has to be updated every year for new Unicode versions) for very little gain. You don't really gain much from Julia's flexibility since the unicode stuff is all going to be concretely typed, and it's not clear to me that there is much to be gained (for the functions we use) by e.g. inlining or constant propagation. It's an awful lot of hassle to go to in order to avoid "a little annoyance" for vscode, and to make it easier to support julia < 1.6 versions that fewer and fewer people care about.

pfitzseb commented 8 months ago

It's an awful lot of hassle to go to in order to avoid "a little annoyance" for vscode, and to make it easier to support julia < 1.6 versions that fewer and fewer people care about.

That's fair.

"A little annoyance" was me understating the issue, because we don't currently have a good way to vendor binary dependencies. That doesn't mean that we can't figure it out or that anyone should go do the work of porting utf8proc, of course.

stevengj commented 8 months ago

Maybe there is a straightforward way to make JuliaSyntax optionally use the version of utf8proc that is linked into julia itself?

This would presumably help the "vendoring" (whatever that is) situation with vscode, and it would also be good for the version of JuliaSyntax that is bundled with julia (so that it doesn't load two identical versions of utf8proc into memory).

Does some Pkg expert (e.g. @vchuravy or @IanButterworth) know a good way to do this that would be friendly with the way JuliaSyntax will be bundled with Julia? Preferences.jl?

c42f commented 8 months ago

"A little annoyance" was me understating the issue

Yea I've have a feeling it could be quite a problem for you. It's troubling that porting (parts of?) utf8proc might actually be the simplest way out of this situation. It seems like taking extreme measures.

Oh, but if you're vendoring JuliaSyntax and other libs, you're already messing with their Project.toml's, yes? So in principle one can modify those to not depend on utf8proc_jll on older Julia VERSION? Or is that not true? Is the vendored stuff for the VSCode plugin shared between Julia versions?

this seems like a lot of effort (and an ongoing maintenance headache since utf8proc has to be updated every year for new Unicode versions)

Mm it seems like buying into integration, testing, bugfixing and ongoing maintenance which is a chunk of duplicated effort :( The initial porting is a lot less scary


For vendoring into Base I think we're in an ok situation - in that case JuilaSyntax isn't a stdlib, it's actually a submodule of Base. So we don't depend on the Project.toml at all. We could branch on something - at toplevel I guess - to detect that situation and use Julia's internal utf8proc in that case.

davidanthoff commented 8 months ago

if you're vendoring JuliaSyntax and other libs, you're already messing with their Project.toml's

No, we are not messing with anything inside the packages that we are using in the LS, we just ship them as part of the extension, and we have a different project/manifest combo for each Julia version for the entire project, but the packages themselves are left as-is. We keep some packages on older versions on older Julia versions, but that really only works if their public API doesn't change...

Also, just for background: this is clearly a major pain point... My hope is that at some point we will get support for statically compiling the LS in Julia. At that point we ship a binary version of the LS with the extension, and we could drop support for running the LS on older Julia versions. I would still want it to be able to analyze code that targets old Julia versions, but overall that is a much, much simpler problem. So my current plan is to continue to go through all these hoops to support running on old Julia versions and hope that the static compile story materializes at some point...

c42f commented 7 months ago

Just FYI, I've not forgotten this. I'm currently working on porting parts of libutf8proc. I've decided to start with porting the data_generation.rb script there from ruby to Julia. That will familiarize me with what's going on a bit in that library. If nothing else comes of this at least I hope the libutf8proc table generation will be cleaner and not depend on both Julia and ruby :)

KristofferC commented 7 months ago

One thing I don't understand.

c42f commented 7 months ago

So how is JuliaSyntax made independent of the julia version by this PR?

JuliaSyntax is vendored into the sysimage as part of Base, it's not a stdlib. In that context, it's certainly dependent on the version of utf8proc shipped with Julia (and it wouldn't use utf8proc_jll - it'd use the builtin library).

However, it's also separately installable as a normal package for use by tooling. In this use case it'd be independent of Julia's VERSION.

As a non-stdlib, the vendored copy's only public API is Meta.parse()

KristofferC commented 7 months ago

and it wouldn't use utf8proc_jll - it'd use the builtin library).

Okay, that answers it.

c42f commented 5 months ago

fyi I'm still plugging away at the upstream implications of a pure-Julia unicode library...

https://github.com/c42f/UnicodeNext.jl

I think this could be generally useful though it does feel a bit marginal in terms of payoff for the effort :woman_shrugging:

c42f commented 5 months ago

Ok, I've pushed an update here which makes this PR use my experimental UnicodeNext library.

Current issues:

Thoughts? This would