apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

New section type that doesn't minimise #96

Open ftyers opened 4 years ago

ftyers commented 4 years ago

At the moment we add regexes in sections. Minimising regexes takes a long time. So perhaps we could have a special type="regex" section that does not minimise, it would speed up compilation of regex-heavy dictionaries.

This will likely break binary compatibility.

mr-martian commented 4 years ago

or it could union with some other section after that section has been minimized to avoid having to create a new section in the binary.

ftyers commented 4 years ago

@mr-martian that sounds a bit more complicated. Also, it would be cool to be able to give weights to sections, but I'll open another issue for that.

mr-martian commented 3 years ago

Upon poking around a bit, I've determined that this would not break the binary format, since section types are just encoded as strings and lt-proc already handles multiple sections of the same type. Have lt-comp relabel type="regex" to type="standard" would result in complete backwards compatibility, or lt-proc can just recognize section names ending in @regex and treat them like @standard.

Either way, this should probably be accompanied by a way to mark <pardef>s as non-minimizing for the same reason. regex="yes", perhaps.

TinoDidriksen commented 3 years ago

This should be optional. For development it should be fast to compile and test, but for distribution it should heavily optimize to the smallest/fastest output binary.

mr-martian commented 3 years ago

Also, it occurs to me that this is tricky because lt-comp minimizes each pardef separately in addition to each section.

unhammer commented 3 years ago

But this is about speed – is minimising each pardef on its own slow? (Last time I checked, the section minimisation at the end was the slow step.)

mr-martian commented 2 years ago

Another alternative is that 0493630 added the ability to compile dictionaries in several pieces, which should alleviate the burden of frequently recompiling the regex sections.

mr-martian commented 2 years ago

In fact, we could have globally shared regex sections, as proposed in https://github.com/apertium/apertium/pull/161

unhammer commented 2 years ago

minimisation has gotten quite a bit faster lately. but there's a related pr at https://github.com/apertium/lttoolbox/pull/165