Sync to latest (v8.5.0)

LaurenzV commented 1 week ago

Luckily for us, those last 1000+ commits once again were mostly about subsetting/instancing/accelerators/boring-expansion-spec, the shaping logic itself remained mostly the same. A couple of questions/uncertainties, but nothing major:

Status	Commit message	HB
🟢	[use] Add FM categories to `POST_BASE_FLAGS64`	Link
🟢	[use] Allow multiple CMAbv glyphs on subjoined	Link
🟢	[ot-map] Minor micro-optimize	Link
🟢	[ot-map] Speed up for default shaper	Link
🟢	[ot-shape] Minor short-circuit	Link
🟢	[ot-shape] Minor short-circuit	Link
🟢	[ot-shape] Short-circuit spaces as well	Link
🟡	.	Link	which function does this exactly apply to in our case? there are so many applys, I kind of lost the overview haha... but since it's just an optional speedup we can ignore it for now. but I would still like to add a TODO in the appropriate places.
🟡	[gsubgpos] Vastly speed up ChainRuleSet / RuleSet matching	Link	same
🟢	Require numerator and denominator in auto fraction	Link
🟢	Revert "Require numerator and denominator in auto fraction"	Link
🟢	[buffer] Fix unsafe_to_concat()	Link
🟢	Revert "Revert "Require numerator and denominator in auto fraction""	Link
🟢	[shape] Unsafe-to-concat around fraction slash	Link
🟡	[ot-map] Speed up feature finding	Link	how do we port this? afaik we can't use hashmaps because of no_std, right?
🟢	[buffer] Minor micro-optimize	Link
🟢	[skippy-iter] Remove early stop	Link
🟢	Add test for previous commit	Link
🟢	[skippy-iter] Remove unused num_items	Link
🟢	[arabic/stch] Zero advances	Link
🟢	[arabic/stch] Also work in left-to-right direction	Link
🟢	[arabic] Add test for previous commit	Link
🟢	[arabic/stch] Center the stretched group over the digits	Link
🟢	[USE] Update the data files	Link
🟢	[Unicode 15.1] Update the Arabic table	Link
🟢	[Unicode 15.1] Update the USE table	Link
🟢	[Unicode 15.1] Update the Indic table	Link
🟢	[Unicode 15.1] Update the vowel constraint table	Link
🟡	[layout] Change order of feature collection	Link	relates to the map issue mentioend above
🟢	[ReverseChainSingleSubst] Minor optimization	Link
🟡	[cmap] Implement MacRoman encoding	Link	After this commit, some tests changed but they didn't change for us. But from what I can tell this is probably because it's not supported in ttf-parser, right?
🟢	[aat] Mark DELETED_GLYPH as IGNORABLE	Link
🟡	[buffer] Add API for random state	Link	add?
🟡	[gsubgpos] Reduce stack use in recursion	Link	I guess this means we should use a vector instead of array here, too?

I've also updated the README and added some of my thoughts on future work for rustybuzz. :)

P.S.: Thanks @behdad for keeping such a clean commit history, it really makes things 100x easier!

behdad commented 1 week ago

P.S.: Thanks @behdad for keeping such a clean commit history, it really makes things 100x easier!

Thank you! It's not everyday that my baby is not shat upon in this project. :)))

Thank you for your impressive work. As you have noticed, shaping hasn't been changing much in recent years. So I'm hopeful that after your work, we can keep rustybuzz up to date moving forward.

LaurenzV commented 1 week ago

I've never ever meant to throw shade at harfbuzz, sorry if I ever said something that suggested that! :(

It's obviously very impressive, the fact that the code is so difficult to "translate" to Rust is obviously not something you can control. 😄 That's just the unavoidable complexity of text shaping.

behdad commented 1 week ago

I've never ever meant to throw shade at harfbuzz, sorry if I ever said something that suggested that! :(

Oh you never did!

It's obviously very impressive, the fact that the code is so difficult to "translate" to Rust is obviously not something you can control. 😄 That's just the unavoidable complexity of text shaping.

I want to eg. address @RazrFalcon's repeated mention of heavy C++ template use in the codebase. What are we supposed to do? Eg. the AAT state-machines are used, in slightly different ways, in mort, kern, morx, and kerx tables. The C++ solution to that problem is templates. We only use those to the extent necessary. I don't know what the Rust solution to the same problem is. And we don't even use much template meta-programming; again, not more than necessary.

RazrFalcon commented 1 week ago

Jesus! You're insane! I thought there is still work for months. Well, likely for us rustybuzz is up to date once again. 🎉

I'm incredibly happy that so many people helped with this project. I wouldn't be able to do it myself. From all of the projects I wrote, this one has the most amount of contributed code. So I want to thank you again, as well as @laurmaedje, @bluebear94, @Pi-Cla and everyone else who helped with backporting.

RazrFalcon commented 1 week ago

. Link which function does this exactly apply to in our case? there are so many applys, I kind of lost the overview haha... but since it's just an optional speedup we can ignore it for now. but I would still like to add a TODO in the appropriate places.

This one? And probably this one.

[ot-map] Speed up feature finding Link how do we port this? afaik we can't use hashmaps because of no_std, right?

Yes, let's ignore. rustybuzz performance is a whole another story.

[cmap] Implement MacRoman encoding Link After this commit, some tests changed but they didn't change for us. But from what I can tell this is probably because it's not supported in ttf-parser, right?

Hm... yes, sort of. ttf-parser maps characters to glyphs only for Unicode subtables. But it also provides a low-level access. Will take a look. Should be doable.

[buffer] Add API for random state Link add?

No need to.

[gsubgpos] Reduce stack use in recursion Link I guess this means we should use a vector instead of array here, too?

Yes, we should. Moreover we already use smallvec just for that. So you can use it instead of Vec.

LaurenzV commented 1 week ago

Done!

RazrFalcon commented 1 week ago

@behdad

I want to eg. address @RazrFalcon's repeated mention of heavy C++ template use in the codebase. What are we supposed to do?

The problem is not the usage of templates themselves, but the way they are used. You use CRTP quite a lot (I hope that's the right term), aka hb_dispatch, which breaks go-to definition in any IDE and I had to use debugger to figure out what code would actually be called. I get it, dynamic dispatch is "slow", but there should be a balance.

And we don't even use much template meta-programming; again, not more than necessary.

hb_dispatch? hb_iter? template<> appears 1043 times. It's everywhere.

You have your own ranges, arrays, hash maps, atomics, mutex implementation, which doesn't help with understanding the code, since people have to learn a new API. Is this purely because you target C++11 or because you have some beef with std? I get it, C++ std is a hot garbage, but implementing your own is not a solution. Like a lot of code could be heavily simplified by using std. Heck, you even have your own string-to-number parser... Yes, in a text shaper.

And I'm not saying this as a random C++ hater. Which I am. I did worked as a C++ developer for nearly a decade. I seen bad and good codebases. You can write human-readable C++ code. Look at Qt or Skia. Yes, it's rare, but possible.

"Performance optimizations" in general make HB codebase hard to read. Do we really need unlikely? How much it improves the performance? 0.1%? Just look at rustybuzz. Not only it is 100% memory safe, unlike HB, but it also barely optimized. And still we're just 2x slower at the worse case scenario. Something to consider.

The other problem with HB is that it's basically C with classes with all the classic C cryptic nonsense. And I wish it to be more of "modern C++".

Like what is this code?!

bool use_cache = font->num_coords;

why not:

bool use_cache = font->has_non_default_variation_coords();

Why it has to be so cryptic?

What about this?

if (var_table.get_length ()) {}

oh, it's actually suppose to be:

if (!var_table.is_empty()) {}

or even better:

std::optional<hb_blob_ptr_t<V>> var_table_opt;
if (var_table_opt.has_value()) {}
// or even
if (auto var_table = var_table_opt) {}

Why we have the ugly:

_hb_glyph_info_is_default_ignorable (&info[i])

instead of human-readable:

info[i].is_default_ignorable()

Why we have

hb_ot_position_default (const hb_ot_shape_context_t *c) {}

instead of

hb_ot_position_default (const hb_ot_shape_context_t &c) {}

in a C++ codebase? Why pointers everywhere?!

Why:

if (!(buffer->scratch_flags & HB_BUFFER_SCRATCH_FLAG_HAS_NON_ASCII)) {}

instead of

if (!buffer->has_non_ascii())

This alone would improve the code readability immensely.

How about some FP? This:

  unsigned int count = buffer->len;
  hb_glyph_info_t *info = buffer->info;
  for (unsigned int i = 0; i < count; i++)
    set_khmer_properties (info[i]);

is effectively:

buffer.info.for_each(set_khmer_properties);

And I'm not even asking the impossible, like having a decent code auto-formatting. Like using spaces instead of tabs-with-spaces. Like always using {} for conditions and loops.

And I can go forever... HB codebase can be improved (imho) so much by a simple refactoring.

It's not everyday that my baby is not shat upon in this project.

I just think that some code is objectively bad. And yes, this includes most of my code as well.

RazrFalcon commented 1 week ago

@LaurenzV I just read your readme update and I completely agree. Test coverage improvement and performance optimizations are the main goals from now.

RazrFalcon commented 1 week ago

@behdad Looks like there is a bug in unicode_to_macroman. It returns 0 for 0x0131, but 0x0131 is in the table. Is it expected or is it because you use hb_bsearch on an unsorted array?

behdad commented 5 days ago

@behdad Looks like there is a bug in unicode_to_macroman. It returns 0 for 0x0131, but 0x0131 is in the table. Is it expected or is it because you use hb_bsearch on an unsorted array?

Looks like it. Filed a bug for that. Thanks.

behdad commented 5 days ago

@behdad

I want to eg. address @RazrFalcon's repeated mention of heavy C++ template use in the codebase. What are we supposed to do?

The problem is not the usage of templates themselves, but the way they are used. You use CRTP quite a lot (I hope that's the right term), aka hb_dispatch, which breaks go-to definition in any IDE and I had to use debugger to figure out what code would actually be called. I get it, dynamic dispatch is "slow", but there should be a balance.

And we don't even use much template meta-programming; again, not more than necessary.

hb_dispatch? hb_iter? template<> appears 1043 times. It's everywhere.

You have your own ranges, arrays, hash maps, atomics, mutex implementation, which doesn't help with understanding the code, since people have to learn a new API. Is this purely because you target C++11 or because you have some beef with std? I get it, C++ std is a hot garbage, but implementing your own is not a solution. Like a lot of code could be heavily simplified by using std. Heck, you even have your own string-to-number parser... Yes, in a text shaper.

I know it's hard to believe, but there's a good reason for most of those...

String-to-number: IIRC there's no portable way to convert numbers in a locale-independent way. glib does the same. Cairo does the same.

I want C++ ranges and concepts, but they are prohibitively new. We have mutex & atomics, mostly from before C++11 standardized them. Now they are mostly a wrapper around std, except when that's not feasible.

Re std use, HB started from the GNOME project where linking to std was not desirable. That might have changed, but last time I tried to pull in std hash-table I hit other issues, like having to link to libstd which is something we still have not decided to do.

Re pointers around, again, it comes from HarfBuzz's roots as a C library. Yes, a lot can be cleaned up if we had more resources. We're no Skia team.

Speaking of Skia, I always hated becoming like Skia: having my own implementation of everything, but HB ended up like Skia as you observe. It's not the first library I've seen to do that. There are valid reasons that you cannot use all the various goodies in std.

std also has no good memory-efficient equivalent of hb_set_t.

hb_dispatch I don't know how I could do better in a more IDE-friendly way.

And I'm not saying this as a random C++ hater. Which I am. I did worked as a C++ developer for nearly a decade. I seen bad and good codebases. You can write human-readable C++ code. Look at Qt or Skia. Yes, it's rare, but possible.

"Performance optimizations" in general make HB codebase hard to read. Do we really need unlikely? How much it improves the performance? 0.1%? Just look at rustybuzz. Not only it is 100% memory safe, unlike HB, but it also barely optimized. And still we're just 2x slower at the worse case scenario. Something to consider.

The other problem with HB is that it's basically C with classes with all the classic C cryptic nonsense. And I wish it to be more of "modern C++".

That's where it came from, yes. And it still exposes a fully C API, which puts certain limitations on the core types. There's also the limitation from Chromium that we cannot have ctor-initializers.

Like what is this code?!
bool use_cache = font->num_coords;
why not:
bool use_cache = font->has_non_default_variation_coords();
Why it has to be so cryptic?

What about this?
if (var_table.get_length ()) {}
oh, it's actually suppose to be:
if (!var_table.is_empty()) {}

That sounds like something that can be improved, yes.

or even better:

std::optional<hb_blob_ptr_t<V>> var_table_opt;
if (var_table_opt.has_value()) {}
// or even
if (auto var_table = var_table_opt) {}

I find that ugly. It's your rust sense of beauty that sees it as an improvement.

Why we have the ugly:
_hb_glyph_info_is_default_ignorable (&info[i])
instead of human-readable:
info[i].is_default_ignorable()

Because hb_glyph_info_t is a public C struct, which can't have methods.

Why we have
hb_ot_position_default (const hb_ot_shape_context_t *c) {}
instead of
hb_ot_position_default (const hb_ot_shape_context_t &c) {}
in a C++ codebase? Why pointers everywhere?!

I learned C++ while developing HarfBuzz. There's a lot of Cism left, as you pointed out.

Why:

if (!(buffer->scratch_flags & HB_BUFFER_SCRATCH_FLAG_HAS_NON_ASCII)) {}

instead of

if (!buffer->has_non_ascii())

You're right according to most people. I find it ugly to have to add two methods for each enum value though. The former reads perfectly fine to me coming from C background of course.

This alone would improve the code readability immensely.

How about some FP? This:
  unsigned int count = buffer->len;
  hb_glyph_info_t *info = buffer->info;
  for (unsigned int i = 0; i < count; i++)
    set_khmer_properties (info[i]);
is effectively:
buffer.info.for_each(set_khmer_properties);
And I'm not even asking the impossible, like having a decent code auto-formatting. Like using spaces instead of tabs-with-spaces. Like always using {} for conditions and loops.

And I can go forever... HB codebase can be improved (imho) so much by a simple refactoring.

It's not everyday that my baby is not shat upon in this project.

I just think that some code is objectively bad. And yes, this includes most of my code as well.

Interesting that you have said you find Qt and Skia to be excellent codebases. But they also reinvent everything, and reading their code my head hurts with all the unnecessary abstractions and indirections.

My goal always was to use a selected subset of C++ for HarfBuzz. That subset evolved over time, but still a subset.

Anyway. I agree the biggest problem with this style is the barrier to entry.

RazrFalcon commented 5 days ago

Speaking of Skia, I always hated becoming like Skia: having my own implementation of everything, but HB ended up like Skia as you observe. It's not the first library I've seen to do that.

That's just C/C++ curse due to a lack of proper dependency management. But I still think that using std will help a lot. Or maybe even abseil if needed. Or some boost headers here and there. Anything will be better than a custom implementation.

I know it's hard to believe, but there's a good reason for most of those...

rustybuzz has nearly zero templates. Except for AAT, which is just a copy of HB logic. Yes, Rust traits help a lot here, since they essentially implement CRTP at the language level. On the other hand, I haven't tried reducing templates in HB. Maybe I will be stuck as well.

First thing I did during porting is got rid of most of macros. I hate them even more then templates. I'm one of few people who doesn't use them even in Rust. It took me a while to wrap my head around hb-shaper.hh. Eventually I simply flattened it using gcc -E or something along those lines. I'm sure there are good reasons for having it, but it's unreadable and undebugable code.

I find that ugly. It's your rust sense of beauty that sees it as an improvement.

Beauty is subjective, but compile time guarantees are not. Yes, std::optional nowhere near Option because it's not checked at compile time. But at least you can immediately see that the value is optional. Code is for humans after all.

Also, I find Rust to be super ugly, but that's not the point.

And no, "std::optional has an overhead" is not an argument I will accept. After all, rustybuzz is 100% memory safe. That's what is important to me and everyone else.

Because hb_glyph_info_t is a public C struct, which can't have methods.

Can't we have an internal struct with the same size and simply cast it when needed? This change alone will improve code quality drastically. In both harfbuzz and rustybuzz. Especially rustybuzz, since that C-style code is very unidiomatic.

It would be great to have C API completely separated from the implementation.

I learned C++ while developing HarfBuzz. There's a lot of Cism left, as you pointed out.

Which is hurting porting process a lot. Well, C++ templates magic do not help much either...

The former reads perfectly fine to me coming from C background of course.

Most people have no idea what if (flags & FLAG) means. But everyone knows what if (flags.hasFlag()) does. C is on of the worst languages in the world. I deeply despise it. That's why I have such strong reaction to code like this. It's just super confusing to everyone.

Interesting that you have said you find Qt and Skia to be excellent codebases. But they also reinvent everything, and reading their code my head hurts with all the unnecessary abstractions and indirections.

I was talking about code style specifically. Yes, "Java of C++". I know they implement everything, but that's only because they are super old and do not want to break everything. harfbuzz do not have such a burden because most of the stuff is private anyway. And Qt in particular us moving to C++ std more and more.

All I wan't from harfbuzz is too look more like modern C++/Java and not like fossil C.

My goal always was to use a selected subset of C++ for HarfBuzz.

And every C++ project has its own subset one have to learn... I'm not saying you should go full "modern C++" insanity, but simply replacing pointers with references everywhere and switching _hb_glyph_info_is_default_ignorable (&info[i]) to info[i].is_default_ignorable() will already simplify rustybuzz support immensely.

I know it all sounds like a hate thread or a holy war, but I want to point out that I've spend like half a year reading harfbuzz code. I have a right to be angry about it.

RazrFalcon / rustybuzz

Sync to latest (v8.5.0) #115