apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Ignoring secondary tags during generation #83

Closed khannatanmai closed 4 years ago

khannatanmai commented 4 years ago

This is needed for generation. Input:

^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
]

Earlier Output:

 #The #dog #of #the #boy #run #fast#.#.[][
]

New Output:

The dogs of the boy run fast..[][
]
mr-martian commented 4 years ago

I don't think this will work on ^arco<n><m><sg><sf:rainbow># iris$

There's also an issue if a secondary tag value contains an escaped $.

khannatanmai commented 4 years ago

I don't think this will work on ^arco<n><m><sg><sf:rainbow># iris$

There's also an issue if a secondary tag value contains an escaped $.

True. Will fix

mr-martian commented 4 years ago

I think this is correct unless unescaped # is allowed in secondary tags. That is, <loc:12#7> (location: sentence 12, word 7, or some such) is problematic unless we require it to be <loc:12\#7>. Dealing with this may require explicitly tracking tag boundaries within the ignore loop.

TinoDidriksen commented 4 years ago

Unescaped ^ $ and # should be allowed inside <>, I'd say. I don't think they currently are, but I see no reason they can't be.

khannatanmai commented 4 years ago

@TinoDidriksen @mr-martian Almost all parsers use '$' as an input to process an LU, so not allowing unescaped special characters is only consistent with the current state of the tool.

Dealing with this may require explicitly tracking tag boundaries within the ignore loop.

Yeah. If we think we really don't want escaping (only for # and $). Then I can implement it.

khannatanmai commented 4 years ago

@mr-martian @TinoDidriksen Now works with any unescaped characters inside secondary tags.

Tests:

Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4\#sabasa><id:2\#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:sabasa><id:2># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4#sabasa><id:2#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:$$4#saba$sa><id:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius

EDIT: Prefixes can have unescaped special characters as well:

echo "^Stroke<n><sg><$$s#^f:$$4#saba$sa><i#$$#^d:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius

Works with compounds:

Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs>+not<adv># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
being not sorry
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz>+not<adv><sf:abc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
being not sorry
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz><id:++$$#>+not<adv><s$f:$+$a##bc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
being not sorry