OpenNMT / OpenNMT

Open Source Neural Machine Translation in Torch (deprecated)
https://opennmt.net/
MIT License
2.39k stars 466 forks source link

-segment_numbers option #510

Open qutie75 opened 6 years ago

qutie75 commented 6 years ago

Hello!

I want to ask about -segment_numbers option.

If i put this option when i tokenize, can i check it in my output file?

This is my command,

th tools/tokenize.lua -case_feature true -segment_case true -segment_numbers true -joiner_annotate true < input_test_en.txt > test.tok and the output is like below.

the│C convention│L in│L 1912│N led│L to│L a│L split│L republican│C party│C ■.│N I expected 1912 segmented like 1 9 1 2 but there is no change…

Please help me. Thank you.

jsenellart commented 6 years ago

hi @qutie75 - yes this is a known issue. -segment_numbers only works with -mode aggressive (so you can use that for the moment) - we will fix that (or block use of the option in non-aggressive mode because it is more in the spirit of "aggressive" than "conversative" tokenization.