RazrFalcon / rustybuzz

A complete harfbuzz's shaping algorithm port to Rust
MIT License
498 stars 34 forks source link

Replace hand-translated machines with nom #76

Closed notgull closed 10 months ago

notgull commented 10 months ago

The _machine.rs files are unmaintainable, as they are a 1:1 handwritten mirror of the machine-generated state machine files for Harfbuzz. My goal is to make this crate more maintainable by writing these files in such a way that it corresponds with the .rl files. This way, changes to the .rl files can be easily reflected in rustybuzz.

My original attempt to replace the indic_machine.rs file, however, has not passed tests. I am not familiar with ragel itself and I am mystified by its semantics. As there isn't a ragel chat room I can ask for help, as far as I know, I figure that this is the best place to ask.

I've focused on one specific test, indic_old_spec_003. The input to the parser looks like this:

categories: [C, H, CM, H, X]

The parser for the consonant_syllable rule looks like this:

c = (C | Ra);           # is_consonant
n = ((ZWNJ?.RS)? (N.N?)?);  # is_consonant_modifier
z = ZWJ|ZWNJ;           # is_joiner
reph = (Ra H | Repha);      # possible reph

cn = c.ZWJ?.n?;
symbol = Symbol.N?;
matra_group = z*.(M | SM? MPst).N?.H?;
syllable_tail = (z?.SM.SM?.ZWNJ?)? (A | VD)*;
halant_group = (z?.H.(ZWJ.N?)?);
final_halant_group = halant_group | H.ZWNJ;
medial_group = CM?;
halant_or_matra_group = (final_halant_group | matra_group*);

complex_syllable_tail = (halant_group.cn)* medial_group halant_or_matra_group syllable_tail;
consonant_syllable =    (Repha|CS)? cn complex_syllable_tail;
broken_cluster =    reph? n? complex_syllable_tail;
other =         any;

Source

For this rule, the first (Repha|CS)? block evaluates to no input, as the first item is C which is neither Repha nor CS. The next item is cn, which matches the C tag and consumed it. The next tag is H, which doesn't match ZWJ or n, so the cn rule completes and we move on to complex_syllable_tail.

The complex_syllable_tail rule starts with (halant_group.cn)*, which would originally match the H, but there is no C at the end, so this rule evaluates to no input. the next medial_group rule is CM?. As the current input is H, CM doesn't match, so this rule evaluates to no input as well. halant_or_matra_group goes to final_halant_group which goes to halant_group which matches H and nothing else, consuming it. Finally, syllable_tail matches the lack of input at the end. Therefore the range from 0..2 is classified as a constant syllable.

Then, the next item on the chopping block is CM. Out of all the rules, this matches the complex_syllable_tail part of broken_cluster, along with the H. Therefore 2..4 is a broken cluster. Finally, the X at the end becomes a non-indic character.

However, this fails the test. After wiring some telemetry to the current rustybuzz master, I've found that it classifies the range from 0..4 as a consonant syllable and the 4..5 range as non-indic. I'm not sure how it does this; it feels like what would happen is that the H is somehow consumed by either the cn or the halant_group.cn before the CM is consumed by the medial_group. Unless ragel's semantics are wildly different from what I understand, it is unclear to me how this would happen.

Is there anyone who knows ragel well enough to help my understanding of it here?

cc #74, @bluebear94

RazrFalcon commented 10 months ago

I'm not sure I would call the nom variant more maintainable. It's a complete gibberish for me. And performance is a big question as well.

If you have time, I would suggest compiling master ragel (which is an autotools hell and you need colm as well) and try its Rust output generator. I've tried it a couple of weeks ago and it was failing with a cryptic error. But I think that modifying hb ragel files to output Rust code directly via ragel is far better from maintainability standpoint.

notgull commented 10 months ago

Superseded by #77