apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

allow post-generation to work without wake-up-mark (<a/>, ~) #42

Closed unhammer closed 2 years ago

unhammer commented 5 years ago

Post-generation should be able to just run on everything LRLM and only apply the changes where it matches (as if it were a version of sed that respects deformatting).

Say for all words in your dictionary, you want to apply the rule …inh t……is…. It's just noisy to have to add a <a/> (or explicit ~ in hfst/lexc) to the RL form-side of every place in your dictionary where that happens, and it's especially noisy if the parts of the form inh are generated by different pardefs.

If postgen didn't have to have a wake-up-mark, but stayed awake constantly, you could just put <l>inh<b/>t<l> <r>is</r> in post.dix and not have any changes to the generator at all.

This might have to be a new option (lt-proc -P, --post-generation-everywhere or something).

(via https://sourceforge.net/p/apertium/mailman/message/36600451/ )

ftyers commented 5 years ago

The -t option is related, but is currently broken I think, see #8.

unhammer commented 3 years ago

https://github.com/apertium/lttoolbox/commit/89c2a0600ba2a739b8ab8ed7120f9cf0d9a5301a can probably be simplified (-t code looked simpler, but doesn't seem to have support for word blanks), but seems to DTRT and runs in 0.7s on something that regular analysis uses 3.2s on while wake-up-mark-pgen takes 0.2s, seems acceptable. Still have to check @khannatanmai 's extensive pgen test suite

mr-martian commented 3 years ago

@unhammer I've made the relevant modifications to the tests in abc337d. It currently fails the first wblank test and I haven't made enough sense of the wblank logic to track down the issue.

mr-martian commented 2 years ago

Say for all words in your dictionary, you want to apply the rule …inh t……is…. It's just noisy to have to add a <a/> (or explicit ~ in hfst/lexc) to the RL form-side of every place in your dictionary where that happens, and it's especially noisy if the parts of the form inh are generated by different pardefs.

It occurs to me that this could also be fixed by composing

"postgen"
0:%~ <=> _ i n h .#. ;

with the generator (though making postgen able to handle this directly is probably still a good idea).

mr-martian commented 2 years ago

fix reverted in 957bc093afcb8def28fe583946ada3b8ac57f85d due to #123