Leon0824 / rimeime

Automatically exported from code.google.com/p/rimeime
1 stars 0 forks source link

cangjie5 wrong ordering of candidates #604

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
To reproduce (I did this on linux, an up-to-date debian sid)

1. Build librime normally, so there's rime_console in build/bin
2. Select cangjie5 manually, by
  % echo "var:" >user.yaml
  % echo "  previously_selected_schema: cangjie5" >>user.yaml
3. Try entering vmomr to it
  % echo vmomr | ./rime_console -i
initializing...ready.
input  : [vmomr]
comp.  : [組合]
page_no: 0, index: 0
cand. 1: [組合]  quality=6.748e-05
cand. 2: [約合]  quality=7.72e-06
cand. 3: [經合]  quality=5.87e-06
cand. 4: [約但河]  quality=3.48e-06
cand. 5: [给]  quality=1e-08
input  : [vmomr]
comp.  : [組合]
page_no: 0, index: 0
cand. 1: [組合]  quality=6.748e-05
cand. 2: [約合]  quality=7.72e-06
cand. 3: [經合]  quality=5.87e-06
cand. 4: [約但河]  quality=3.48e-06
cand. 5: [给]  quality=1e-08

Notice that the fifth candidate, [给], has cangjie5 code "vmomr". I would 
expect it to be the first, rather than the partial matching phrase, [組合] 
(the complete code for it is "vfbm omr").

This bug basically renders canjie5 unusable, because there're many other common 
characters ending up not appearing as the first in the candidate list.

Thank you,

Yixuan

Original issue reported on code.google.com by culu....@gmail.com on 14 May 2014 at 4:39

GoogleCodeExporter commented 9 years ago
The version of librime in the report is 1.1. (Sorry for forgetting it.)

Original comment by culu....@gmail.com on 14 May 2014 at 4:42

GoogleCodeExporter commented 9 years ago
before step 2, the current dir should be changed:
% cd build/bin

Original comment by culu....@gmail.com on 14 May 2014 at 5:39

GoogleCodeExporter commented 9 years ago
I do not agree that 给 is a useful word, since Cangjie is generally conceived 
as an IME for traditonal Chinese. On the other hand, 組合 is not a partial 
match, but a full matching phrase in the form of AABBB, as you wouldn't say: 
十十人一弓 for 輸 is a partial match, the complete code being 
十田十人一月中弓.
You should have noticed that not only word frenquency, but also the phrases are 
in traditional Chinese. You have to create a different dictionary for 
simplified Chinese, which is not provided by the package.

Original comment by chen....@gmail.com on 14 May 2014 at 7:28

GoogleCodeExporter commented 9 years ago
> I do not agree that 给 is a useful word, since Cangjie is generally 
> conceived as an IME for traditonal Chinese. On the other hand, 組合 
> is not a partial match, but a full matching phrase in the form of 
> AABBB, as you wouldn't say: 十十人一弓 for 輸 is a partial match, 
> the complete code being 十田十人一月中弓.

Thank you for this explanation. However, I don't find it quite
convincing. Altough starting out as a traditional Chinese IME, 
Cangjie5 incorported simplified charaters later, and can be used 
as a simplified Chinese IME. [1][1.zh] (Your notion that Cangjie5
is more likely to be related to traditional Chinese, is reasonable.)

> You should have noticed that not only word frenquency, but 
> also the phrases are in  traditional Chinese. You have to 
> create a different dictionary for simplified Chinese, which 
> is not provided by the package.

Now I understand that this is not a bug in librime, but it
is one in brise, or preset/cangjie5.dict.yaml, to be specific.

preset/cangjie5.dict.yaml:
   19 ---
   20 name: "cangjie5"
   21 version: "0.18"
   22 sort: by_weight
   23 use_preset_vocabulary: true
   24 max_phrase_length: 7
   25 min_phrase_weight: 100
   26 columns:
   27   - text
   28   - code
   29   - stem
   30 encoder:
   31   exclude_patterns:
   32     - '^x.*$'
   33     - '^z.*$'
   34   rules:
   35     - length_equal: 2
   36       formula: "AaAzBaBbBz"
   37     - length_equal: 3
   38       formula: "AaAzBaBzCz"
   39     - length_in_range: [4, 10]
   40       formula: "AaBzCaYzZz"

I've tried but failed to find any evidence that could
support the phrase rule of AABBB. I've searched through the
tutorial on chinesecj.com[2], the cited Cangjie5 manual[3],
and an updated version of this manual[4][4.web], cited by 
wikipedia[1.zh]. In addition, a friend from Hong Kong told
me that she would normally type the word "組合" separately
by "vfbm", and then "omr". Also, the result of typing in 
"vmomr" on her computer is "给".

Therefore, I suspect this is a bug in brise:preset/cangjie5.dict.yaml,
the encoder/rules part, line 34 to 40.

[1] http://en.wikipedia.org/wiki/Cangjie_input_method
[1.zh] 
http://zh.wikipedia.org/wiki/%E5%80%89%E9%A0%A1%E8%BC%B8%E5%85%A5%E6%B3%95
[2] http://chinesecj.com/newlearncj/
[3] http://www.cbflabs.com/down/show.php?id=28
[4] http://www.cbflabs.com/down/show.php?id=299
[4.web] http://www.cbflabs.com/book/ocj5/ocj5/index.html

Original comment by culu....@gmail.com on 14 May 2014 at 5:32

GoogleCodeExporter commented 9 years ago
It's a non-standard feature, not a bug.
See also:

http://www.chinesecj.com/forum/forum.php?mod=viewthread&tid=634

http://tieba.baidu.com/p/1028390846

You can disable phrases by removing the 'encoder' part.

Rime is designed to be highly configurable.
The preset schemata will definitely not satisfy every one, but they well 
illustrate most features the framework provides. Feel free to create your own 
schema.

Original comment by chen....@gmail.com on 15 May 2014 at 2:24

GoogleCodeExporter commented 9 years ago
candidates

illustrate most features the framework provides. Feel free to create your
own schema.

Thank you. I created a modified version of cangjie5.dict.yaml in my user
directory, under a different filename, and added a translator/dictionary
entry in cangjie5.custom.yaml. It works. I just don't understand why just
changing translator/enable_encoder to false doesn't work. (Perhaps it's a
different feature under a similar name.)

Also, I don't see the reason for using a non-standard extension as default,
although it demonstrates the strength of the "encoder" feature quite well.
Is it that high configurability renders default values no longer important?

Original comment by culu....@gmail.com on 16 May 2014 at 2:59

GoogleCodeExporter commented 9 years ago
(Sorry, the first three lines weren't quoted properly through email.)

Original comment by culu....@gmail.com on 16 May 2014 at 3:02

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
'translator/enable_encoder: true' will enable the table_translator to make new 
phrases dynamically based on user input. The pre-installed phrases are 
introduced into the dictionary by setting 'import_preset_vocabulary: true' in 
cangjie5.dict.yaml.

In the context of traditional Chinese, phrases rank lower than frequently used 
characters thus they hardly break anything, and those who use phrases benefit 
from less typing and speed gain.

Original comment by chen....@gmail.com on 16 May 2014 at 8:14