Investigate on implementing segmenter by using ICU's break rules data

aethanyc commented 3 years ago

After discussing with ICU4X teams and experts from ICU, Markus suggested we should investigate a bit more on implementing the rule-based break iterator by using the approach in ICU4C. Quote from his suggestion:

I recommend taking the output from the rule builder -- the code point trie and state table that it puts into the .brk files -- serialize those into a convenient format, and only write runtime code for the FSM

More details in 2021-02-26 Note section, the bullet point "Discuss pros and cons of: porting Rust crates for UAX 14, UAX 29 vs. writing code to read the ICU data for RBBI".

If we could gather pros and cons on this approach, we'll have more information to weight it against #1 (i.e. integrating uax14_rs into ICU4X).

aethanyc commented 3 years ago

Quote Andy's comment in the doc:

There have been a fair number of requests over the years. Mostly not done because it's hard, and because tailorings require a lot of data at run time. ICU can't do them as deltas from a base - each tailoring needs a full, independent FSM.

https://docs.google.com/document/d/1ojrOdIchyIHYbg2G9APX8j2p0XtmVLj0f9jPIbFYVUE/edit?disco=AAAALlSiRm8

aethanyc commented 3 years ago

@makotokato Do you have any insights on whether we should build icu4x's segmenter by using ICU's approach or not?

makotokato commented 3 years ago

I think that ICU approach isn't better for performance. ICU will load binary format (brk) for rules, then use it. But uax14_rs (and xi-unicode) uses generated native code instead of data binary. As web browser, line-break property is additional rules set of UAX14, so we can compact code size to implement it. ICU has other rule set binary each line-break property value.

If using brk file, I guess that we should use generated code like uax14_rs/xi-unicode from brk/txt rule file.

Also, ICU binary file format isn't compatible with each version. When I integrated ICU into Firefox, each ICU data file is version-depended. So I cannot reduce binary size on Android build. So we need build binary format each version.

But I guess that implementing rule based breaker (sentence segmenter and etc if necessary for future version of Intl.Segmenter) like ICU may be worth to compare code size and performance.

jfkthame commented 3 years ago

The only way to be sure would be to implement & benchmark, but I'd be surprised if there was a significant performance difference. As I understand it, the ICU breakIterator compiles the rules into a finite state machine, which in principle should have similar performance whether it's stored as data in a .brk file or encoded directly in the source code. It's possible that hand-crafted state machines in Rust source could be optimized to do slightly better for a specific set of break rules, but the basic algorithm of running a state machine over the text will be comparable.

If the actual .brk binary format isn't efficient for us to use from Rust code at runtime, we might want to transform it into a different structure/representation; I haven't looked into the format to understand exactly how it's stored and whether it would be suitable to use directly, but it seems like a good place to start.

Eventually, I think we'll want to have a build-time step that takes ICU's .txt rule files as input and generates the efficient state machine that we want to use at run-time; we can get started by letting the ICU build step do this and just taking the .brk file from there, but later (assuming that works well) it would be good to remove the dependency on the ICU build, and have a rule compiler in our tree.

sffc commented 3 years ago

CC @markusicu: can you comment on the above?

About the file format: I don't think we should load actual .brk files. Markus said the .brk files contain a CodePointTrie (which we will be able to consume, pending https://github.com/unicode-org/icu4x/issues/508) plus a state machine table. We can get both of those things into ArrayBuffers served by the DataProvider, so we don't rely on the .brk file format itself.

aethanyc commented 3 years ago

If we'll be able to use CodePointTrie to query the break property from a codepoint, then the remaining question is the file format of the state machine table (rules defined in uax14 or uax29 spec) and how to generate the table.

I feel it makes sense for the rules to served by the DataProvider. Currently, this integration branch generates the rules tables in generate_properties.py in a separate step, and we could potentially rewrite it into a DataProvider. (@makotokato please correct me if I misunderstand this). However, I don't have a clear picture on how to generate the rules tables from ICU's brkitr rules. It may be possible to add some code in ICU4C's RBBIRuleBuilder to dump data for Rust to consume though.

sffc commented 3 years ago

However, I don't have a clear picture on how to generate the rules tables from ICU's brkitr rules. It may be possible to add some code in ICU4C's RBBIRuleBuilder to dump data for Rust to consume though.

It is fully in scope to have such a tool in ICU4C (dumping data for Rust to consume).

aethanyc commented 3 years ago

It seems the issue to create the ICU tool to dump data for Rust is in https://github.com/unicode-org/icu4x/issues/509.

aethanyc commented 2 years ago

In the end, we don't use ICU break rule data in ICU4X, but implement the spec rules from scratch.

aethanyc / icu4x

Investigate on implementing segmenter by using ICU's break rules data #2