RazrFalcon / rustybuzz

A complete harfbuzz's shaping algorithm port to Rust
MIT License
498 stars 34 forks source link

Deprecation #74

Closed RazrFalcon closed 4 months ago

RazrFalcon commented 10 months ago

@laurmaedje Hi! I'm giving up on this project. I have no further plans on working on it. I know you're using it in typst, so it's probably affects you. I plan to either archive it or pass to someone else.

As for resvg, which is the reason this projects exists, I haven't decided yet. But basically I have a "choice" of using this deprecated version for now, switch to harfbuzz bindings or try out swash (I'm very skeptical).

Keavon commented 10 months ago

Sad to see this :( We use it in Graphite currently, and will depend on it even more when we add more typesetting and desktop publishing related features.

I know COSMIC Text is also built upon rustbuzz so it might be worth pinging @jackpot51 as well.

I noticed you called out Swash but didn't mention Allsorts. The latter is still maintained. I'm not very familiar with either of them in terms of their maturity and features, but I was wondering if there's a reason you didn't mention Allsorts or comment on its status by comparison to Swash.

notgull commented 10 months ago

I would be willing to take over maintenance for bug fixes (although I probably won't add any features or make any significant optimizations). As cosmic-text and theo also both depend on this crate, I would like to see this maintained, and I'm sure there are many others who would as well.

@RazrFalcon It seems this crate is in the same boat as some of your other crates, like tiny-skia: used extensively throughout the ecosystem yet you no longer have the time/energy to maintain it. How would you feel about creating a GH organization, transferring some of your repos there, and then adding some interested volunteers to maintain those crates? I would definitely be happy to participate given the option.

Keavon commented 10 months ago

This news might also be worth posting to Reddit. It's an invaluable part of the ecosystem and that could garner contributors. But I definitely see how...

Since v2.7.1, harfbuzz received 5813 commits.

...could be a pretty daunting proposition to keep up with.

RazrFalcon commented 10 months ago

@Keavon cosmic-text also supports swash, so they should be fine. Both shapers are basically dead anyway.

I do mention allsorts at the end of the readme. It should be pretty good, but it lacks some features (variable fonts for one) and focuses on subsetting.

In general, it's extremely hard to compare various shaping libraries. I guess running one agains harfbuzz's test suite is the only option right now. Because it's the golden standard.

This news might also be worth posting to Reddit.

@notgull I need someone who would be back-porting harfbuzz changes. Which no one probably will. Otherwise there is no point.

The problem with porting harfbuzz in the first place, with all due respect to the authors, is that it uses a very complex C++ subset. And despite the fact that I was a C++ developer myself for almost 10 years - I simply cannot read it. Templates with CRTP, custom iterators, custom std, macros, and so on makes it extremely hard to follow. Sure, I'm a C++ hater, but I still cannot read C++14 and newer. I just physically can't. This is why I've eventually switched to Rust and Swift. You basically have to rewrite harfbuzz into a simple/sane C++ first, spending days with a debugger, and then rewrite it to Rust. This is what I did! I wasn't porting harfbuzz, but rather a harfbuzz fork. And don't get me started on ragel...

Just to illustrate, it took me 6-8 months to port harfbuzz and 2-3 to port Skia (tiny-skia). And the amount of code is roughly the same. Skia's codebase is the best C++ codebase I ever seen, similarly to Qt. It has its quirks, but it's still very manageable and intuitive.

It's an invaluable part of the ecosystem

Well, I consider it a failure. I knew it was a bad idea even during writing. Fun fact, I technically gave up 2/3 of the way. If not @laurmaedje it would never be finished. The way harfbuzz and rustybuzz are written is highly different. It's not a C++ port to Rust, but rather a harfbuzz core algorithm rewrite in Rust. Which makes back-porting new changes extremely difficult.

RazrFalcon commented 10 months ago

@notgull

It seems this crate is in the same boat as some of your other crates, like tiny-skia

Well, all of my crates are sort of dead. And there are many reasons for that. One of which is a complete lack of time. But I still can accept patches, which no one really sends. So... I do have time maintaining them, but not developing. Maybe after a couple of years I would have some free time.

The issue with rustybuzz in particular is that it must be updated/synced.

bluebear94 commented 10 months ago

Hi, I’m the developer of Caxton and a contributor to Typst. I’m also willing to help in any way that I can.

notgull commented 10 months ago

Hi, I’m the developer of Caxton and a contributor to Typst. I’m also willing to help in any way that I can.

I think the first action item would be to port this code to match later versions of Harfbuzz. The first step would probably be to sync up with Harfbuzz v2.7.4, aka #37. Here's a list of all of the commits between v2.7.1 (which is what the current version of rustybuzz is built to match) and v2.7.4: https://github.com/harfbuzz/harfbuzz/compare/2.7.1...2.7.4

Another important thing to do would be to make the current code more maintainable. There's a few instances where machine-generated parser code has been hand-translated to Rust, and I'd like to port that code to use code that more accurately matches the Ragel code that it was generated from. This way it's much easier to match what the Harfbuzz team is doing.

I would normally reach for something like nom in this use case, but @RazrFalcon has a policy to use as few external dependencies as possible in their projects. Therefore it would probably be best to write a mini-parser-library inside of rustybuzz and then rewrite all of the parsing code using that.

At the moment I find myself preoccupied with getting Smol v2.0 out the door. But in a week (hopefully) I'll have freed myself up to focus on this.

RazrFalcon commented 10 months ago

@bluebear94 You're welcome!

Note that rustybuzz already has a couple of fixes from later versions, sort of. For example this fix already present. So don't be surprised.

RazrFalcon commented 10 months ago

@notgull

I would normally reach for something like nom in this use case

Why would you need nom? Ragel is a state-machine generator, not parser-generator.

jackpot51 commented 10 months ago

Different folks are going to need different things. For cosmic-text I need a solid shaping solution with limited dependencies and the current version of rustybuzz is exactly that. I don't need it to stay in sync with harfbuzz, except when it changes what a user sees for the better. There are also numerous improvements that could be made to the API for my use case that would likely make backporting harfbuzz changes harder, but would improve performance of shaping and especially font fallback in cosmic-text. Due to having different needs than the other users, I am likely to fork rustybuzz into a component that is part of cosmic-text. The swash shaper would have to be as feature complete as rustybuzz for me to consider using it, a regression in cosmic-text capabilities is not acceptable.

RazrFalcon commented 10 months ago

@jackpot51 The problem with text shaping is that it cannot be "finished". It's a moving target. OpenType/AAT updates, Unicode updates. rustybuzz is a decent Unicode 12 shaper, not a Unicode 15 one. You would probably not see any issues in the near feature, but they will crop up. Not to mention actual bugs in rb/hb. That's why rustybuzz must be in-sync with harfbuzz. Otherwise there is no point. I don't care about version numbers and other superficial stuff. The newer harfbuzz is actually a better shaper.

As for the API, I know it's meh, but it's what harfbuzz provides as well. You're free to open issues and send patches. Some caching would help with performance as well, but it's hard to do in a safe way, unlike in harfbuzz.

jackpot51 commented 10 months ago

@RazrFalcon if it is the case that none of the independent pure Rust shaping projects will be able to keep up with harfbuzz, I will look into using harfbuzz directly.

RazrFalcon commented 10 months ago

@jackpot51 Honestly, making a harfbuzz wrapper is the best solution for now. Especially since cosmic-text is mainly a Linux-only library, to my understanding. But for something like resvg it's a nightmare, because it needs a C++ toolchain, it breaks wasm (afaik) and adds a lot of bloat, because it includes a lot of things you do not need, including C++ stuff (not a problem when dynamically linking on Linux).

There is really no ideal solution. And text shaping is an absurdly complicated task to "just write" one.

jackpot51 commented 10 months ago

All of those problems still apply to cosmic-text as well, which supports all major OS platforms as well as no_std usage.

wez commented 10 months ago

FWIW, in wezterm, I vendor in and wrap harfbuzz directly. For my purposes this gives me a consistent version of harfbuzz on all platforms. I don't have any wasm or no_std platforms to support so the C++ toolchain is an acceptable dependency for me.

https://github.com/wez/wezterm/tree/main/deps/harfbuzz "-sys" crate equivalent https://github.com/wez/wezterm/blob/main/wezterm-font/src/hbwrap.rs - slightly higher level bindings

I do something similar with freetype, because there is some inter-dependence between these two libraries.

behdad commented 10 months ago

FWIW HarfBuzz doesn't link to or require libc++ / libstdc++. I don't know how it would break wasm. We definitely ship HB wasm in https://github.com/harfbuzz/harfbuzzjs

wez commented 10 months ago

Also: I agree that trying to track and port harfbuzz changes into a rustybuzz or some other project is a large ongoing undertaking. The harfbuzz folks are actively innovating and improving all the time.

Perhaps an alternate strategy to solve these problems from the perspective of the rust community would be to get a sense of whether there is interest/desire amongst the harfbuzz folks to see harfbuzz itself migrate to being implemented in rust and working with them to incrementally migrate from the inside out? That's also a huge undertaking, but it would be a bounded undertaking with lasting effects.

CryZe commented 10 months ago

Rust's wasm32-unknown-unknown target (currently) can't link to C/C++ because of ABI incompatibilities.

behdad commented 10 months ago

Perhaps an alternate strategy to solve these problems from the perspective of the rust community would be to get a sense of whether there is interest/desire amongst the harfbuzz folks to see harfbuzz itself migrate to being implemented in rust and working with them to incrementally migrate from the inside out? That's also a huge undertaking, but it would be a bounded undertaking with lasting effects.

We are definitely interested. And it's in the scope for the https://github.com/googlefonts/oxidize project. I just would hate to give up some of the conveniences and optimizations of the C++ implementation. Namely the zero-parsing model; giving that up is a nonstarter to me.

cc @rsheeter

wez commented 10 months ago

We are definitely interested. And it's in the scope for the https://github.com/googlefonts/oxidize project.

Great to hear! I have some experience in incremental migration to rust from my time in a FAANG, and also with using harfbuzz's API. I'm interested to see that work progress and perhaps even participate... if there's potential to get funded to do that, that will help as well :)

Namely the zero-parsing model

Can you clarify what you mean by that? Is that essentially memory mapping / casting buffers as the associated structs, or deferred/lazy parsing for certain parts of the data?

(happy to relocate this discussion to the oxidize project if it feels like we're getting too far off-topic from the fate of rustybuzz)

behdad commented 10 months ago

Namely the zero-parsing model

Can you clarify what you mean by that? Is that essentially memory mapping / casting buffers as the associated structs,

Yes. And relying on operator overloading to do byte-order swapping...

wez commented 10 months ago

I see no reason why that couldn't be made to work in Rust, could either be done via some newtype wrappers or in some cases could derive via a proc macro accessor functions that both could handle the byteswapping on deref/access and so on.

rsheeter commented 10 months ago

For context, https://lib.rs/crates/read-fonts aims to provide HB-style reading. However, we won't get to shaping until we (optimistic hat firmly on) finish landing https://lib.rs/crates/skrifa.

RazrFalcon commented 10 months ago

@behdad

Namely the zero-parsing model; giving that up is a nonstarter to me.

Is it really that faster? ttf-parser has zero unsafe and does basically the same. Maybe with a slightly higher overhead due to imperative code instead of C++ templates-based DSL. swash even claims to be faster that HB, while using the same idea. But maybe it pulls ahead in non-parsing stages, which is also strange, because it would be hard to beat state-machines.

Last time I've checked, rustybuzz wasn't that much slower (sure 50% is a lot, but not at ms scale) and there are a lot of optimization opportunities left. While it is 100% memory safe and has 0 memory leaks. A very tempting trade of.

HarfBuzz doesn't link to or require libc++ / libstdc++

I'm aware of that, and this in itself a strange feature, but by a toolchain I've meant the compiler as well. Right now you do not need a C++ compiler to build rustybuzz and resvg, which is the core idea and would not change.

RazrFalcon commented 10 months ago

@wez

happy to relocate this discussion to the oxidize project if it feels like we're getting too far off-topic from the fate of rustybuzz

All good. This is the right place for such a discussion. I think it's a good illustration of how complex text shaping is, when Rust has 3 independent implementations and all of them are sort of dead. Writing one is one thing. Keeping it up to date is completely different. At least allsorts is company-backed, which is the only way for such a library to survive, imho. Sadly, allsorts has different priorities.

behdad commented 10 months ago

Is it really that faster? ttf-parser has zero unsafe and does basically the same. Maybe with a slightly higher overhead due to imperative code instead of C++ templates-based DSL.

It's not faster. But a lot more memory efficient. See for example:

https://docs.google.com/document/d/12jfNpQJzeVIAxoUSpk7KziyINAa1msbGliyXqguS86M/preview

I know for example Android cares a lot about that.

RazrFalcon commented 10 months ago

@rsheeter Yeah, I'm following fontations. I'm interested to see how well code generation would work. I've tried it myself, but it felt needlessly complicated and limited with little benefits. Sadly, most of TrueType tables are not just POD structures. CFF is a good example of that. Or even glyf.

Keavon commented 10 months ago

(Just to interject, for Graphite we need Wasm support; currently we use RustyBuzz but had planned to switch to Cosmic Text so our preference would be keeping that pure Rust if at all possible.)

jackpot51 commented 10 months ago

I'm planning to maintain no_std and wasm support in cosmic-text no matter what happens.

RazrFalcon commented 10 months ago

@behdad Yes, I saw this paper, but ttf-parser doesn't allocate as well. rustybuzz does allocate some GSUB/GPOS metadata, but it's a temporary hack. Otherwise memory usage between rustybuzz and harfbuzz should be identical.

The only overhead Rust has over C++ in this case is a mandatory bounds-checking. Which swash and allsorts try to avoid with some unsafe. I'm not a good programmer, so I tend to avoid unsafe completely.

The main benefit of Rust's/ttf-parser approach is that there is no need for separate validation and parsing steps, like in harfbuzz, which makes the code much simpler. But you do have to pay the higher price when parsing the same data over and over again.

And honestly, I was so burnt down by this port, that correctness was my only priority. I spent no time optimizing it.

laurmaedje commented 10 months ago

@RazrFalcon Thanks for the heads-up! I totally understand the decision as keeping up with HarfBuzz is indeed a daunting task.

For Typst, the primary reason for using rustybuzz was the ability to compile to WebAssembly. Sadly, Rust's story for linking with C while compiling to WebAssembly looks pretty much the same as it did in 2020 (when I initially looked for an alternative). The fact that being pure Rust also simplified builds in general was more of a side effect, but a very nice one that I'd hate to give up.

I think the thing that really needs fixing is Rust's linking story with WebAssembly. But if HarfBuzz itself were to move to Rust, that would of course be amazing. Both things will take time though. For the time being, I think keeping this project afloat until a good long-term solution emerges is worthwhile. I would be happy to help with maintaining the crate and assisting any potential contributors. I won't have time to port the changes myself though since I'm stretched for time too and Typst (as a company) still has very limited resources.

RazrFalcon commented 10 months ago

@laurmaedje I don't think better linking would help much. If you use harfbuzz just for shaping, linking statically, 60% of the code is just bloat. You do not need subsetting, Unicode tables, custom C++ std implementation or even TrueType parsing. So even from a binary size perspective a pure Rust implementation is still better. And hey, it's even 100% memory safe.

This is the reason why I would also not switch rustybuzz for swash or allsorts in resvg, because both of those crates have their own TrueType parsing code, instead of ttf-parser. And their own Unicode handling code. And it would blow up pretty quickly.

If only we could make text shaping simpler...

laurmaedje commented 10 months ago

For what it's worth, I actually do need subsetting (and my own subsetting crate is quite simplistic). I can't judge how much the overhead would really be in the end. But I agree that pure Rust is nice.

RazrFalcon commented 10 months ago

Implementing subsetting would double, if not triple, the complexity. So if you need the whole package - harfbuzz is the only choice. I don't even know any other libs that do it. Sure, allsorts does, but it's far more primitive. CoreText internally does support subsetting for PDF writing, but I'm not sure it has a public API. It's an extremely niche task.

I don't quite remember how subsetting works in harfbuzz, but I think it has a pretty low-level access to TrueType tables, that ttf-parser simply doesn't provide. It would probably require a significant rewrite. And this is definitely out of scope. I prefer single purpose libraries, which is obviously not possible in C/C++ due to a lack of package management.

khaledhosny commented 10 months ago

@laurmaedje I don't think better linking would help much. If you use harfbuzz just for shaping, linking statically, 60% of the code is just bloat. You do not need subsetting, Unicode tables, custom C++ std implementation or even TrueType parsing.

Almost anything can be compiled out in HarfBuzz, any there are many optional size optimization that can be switched on if size is more important than memory consumption or speed.

RazrFalcon commented 10 months ago

@khaledhosny That's true. But I don't think you can disable TrueType parsing and Unicode tables. Heck, harfbuzz even has its own mutex.

behdad commented 10 months ago

@khaledhosny That's true. But I don't think you can disable TrueType parsing and Unicode tables.

You can definitely disable Unicode tables and provide your own. That's the whole point of hb_unicode_funcs_t.

Heck, harfbuzz even has its own mutex.

If you're bashing HarfBuzz for having an abstraction over pthreads, Windows, and c++11 mutexes, I don't know what to say. I'm sure Qt and Skia do the same.

khaledhosny commented 10 months ago

@khaledhosny That's true. But I don't think you can disable TrueType parsing

You can supply your own font funcs, but some tightly integrated tables are always parsed by HarfBuzz, this gives us control over shaping-critical performance and memory consumption, as well as the ability to innovate and expand OpenType beyond what is currently possible.

Code is written to help people not to be literary master pieces, and HarfBuzz cares about correctness as well as performance. I don’t think there is any other text shaping engines that beats HarfBuzz in these two areas, they can be more performant but less correct, or correct but less performant (or neither correct nor performant). HarfBuzz is also as flexible as it gets, without sacrificing any of this. So may be give HarfBuzz the benefit of doubt next time you wonder why it does something in a certain (may be unusual) way.

RazrFalcon commented 10 months ago

@khaledhosny I'm aware of hb_font_funcs_t, but they are obviously rather limited. The point was that mixing languages, when compiling statically, leads to bloat. That's inevitable.

I'm not saying that harfbuzz is bad or anything, I just find some design decisions weird. Like the fact that harfbuzz reinvents C++ std. I'm sure all of them have an explanation, but I still find them weird.

You're taking it too seriously. I did spend half a year reading harfbuzz source code. I've very familiar with it. But I also did spend too much time deciphering "modern C++"...

alerque commented 10 months ago

I know by this point you've gotten several offers of help to carry on this project, including very capable ones. That being said I'd like to offer myself as well to lend a hand here an there. I have some involvement in the Harfbuzz project upstream in contributing CI workflows, GNU auto tools build tooling, and some release management bits and bobs. I don't have very much C/C++ experience so my code contributions there have been limited, but I do understand a lot about shaping and the issues involved. I am very interested in a pure Rust implementation both for my own use downstream and just to improve the ecosystem. Typst already uses and I've had an eye on making in an optional shaper in SILE. Long term if Harfbuzz gets a gradual rewrite in Rust I would be very interested in helping with that, but I'm also willing to contribute here in the mean time. I especially have a lot of experience with issue triage, PR review, CI workflows, and other aspects that are needed to keep FOSS projects rolling smoothly. I'm also an Arch Linux packager (as well as former PLD Linux dev and others) and have a pretty good handle on how shapers fit into the downstream ecosystem—and also on how to get along and play nicely with other devs. If you're interested in my lending a hand facilitating ongoing development feel free to add me on and I'll pitch in where I can.

RazrFalcon commented 10 months ago

@alerque Thanks for the offer. First, we have to figure out who is updating what. I don't think there are any tasks that can be done in parallel by multiple people. Maybe people would be willing to switch for each hb version. I can open a discussion thread for communication.

As for C++ experience, I don't think it's that necessary. Most of the work is to check commits for relevant changes. This function in this file added a new check and we have to replicate it Rust. It's not hard per se, just very tedious and time consuming. There are serious changes here and there, but they are relatively rare. You have to remember that rustybuzz is not a full harfbuzz port, but rather a harfbuzz's shaper port. Most changed do not affect us.

Other than that I don't think I need much help. CI is as complete as it should be. Packaging is out of scope. But PR reviews are welcome.

behdad commented 10 months ago

If you ever want to move this repo to the harfbuzz org, we'd be more than happy to have it. And if volunteers update the code to the latest HB version, we can try to keep it up to date moving forward.

RazrFalcon commented 10 months ago

@behdad Thanks! It would be an honor. I will consider this when we finish syncing it, if it will eventually happen.

DemiMarie commented 9 months ago

@behdad would you ever be interested in replacing the C++ version of Harfbuzz?

alerque commented 9 months ago

@DemiMarie Perhaps read through the messages above...

DemiMarie commented 9 months ago

Perhaps an alternate strategy to solve these problems from the perspective of the rust community would be to get a sense of whether there is interest/desire amongst the harfbuzz folks to see harfbuzz itself migrate to being implemented in rust and working with them to incrementally migrate from the inside out? That's also a huge undertaking, but it would be a bounded undertaking with lasting effects.

We are definitely interested. And it's in the scope for the https://github.com/googlefonts/oxidize project. I just would hate to give up some of the conveniences and optimizations of the C++ implementation. Namely the zero-parsing model; giving that up is a nonstarter to me.

cc @rsheeter

That would be absolutely stunning! Zero-parsing is definitely doable in Rust, and you can use the bytemuck crate to derive the casting operations. The main caveat is that accessing memory mapped files is generally unsafe unless you take precautions against the files being mutated out from under you, since this will cause undefined behavior if you have references to the mapped region. Fuschsia has a shared_buffer crate for this purpose.

behdad commented 9 months ago

The main caveat is that accessing memory mapped files is generally unsafe unless you take precautions against the files being mutated out from under you

That's indeed the case with current HB implementation. We haven't received any bug report ever about it though.

DemiMarie commented 9 months ago

The main caveat is that accessing memory mapped files is generally unsafe unless you take precautions against the files being mutated out from under you

That's indeed the case with current HB implementation. We haven't received any bug report ever about it though.

That’s probably because it never makes sense to mutate a file (most likely a font fiile) while Harfbuzz is looking at it. Doing so invariably indicates either user error or a bug somewhere else, and would produce garbage even if the current implementation of HB was 100% safe in that situation.

The one time that this could cause problems is if an attacker can cause Harfbuzz to look at a file that the attacker is mutating, in a situation where Harfbuzz runs with privileges the attacker doesn’t have. For instance, it isn’t safe to have Harfbuzz use a font file that is writable by a user one doesn’t trust. It also isn’t safe to have Harfbuzz use a font file located on a network share if one doesn’t the server of that share and everyone with write access to that share. These situations are sufficiently obscure that I’m not surprised nobody has pointed them out before.

DemiMarie commented 9 months ago

I can think of several solutions to this problem:

  1. Ignore the problem, and declare that one should just Not Do That. This is a perfectly valid decision!
  2. Only use mmap() for files in trusted locations, such as the well-known system-wide and per-user font stores. Make a copy otherwise.
  3. Make HB safe in that situation, using something like the aforementioned shared_buffer crate.

In any case, this should not block porting Harfbuzz to Rust!

behdad commented 9 months ago

I should study how the shared_buffer crate makes it save. About the second option, HB itself most of the time doesn't open the file, so it's a user problem. As in: we have added API for opening the file, but that's secondary to the main API that just takes memory and ownership type and uses it.

FWIW, FreeType also opens files with mmap...