Closed bksubhuti closed 3 years ago
maybe we can do even lower than 15, for example there are ~140k entries with myanmar definition containing <= 5 words... let me know what you think, and yes, one way to do it, to auto mark untranslated words with > 6 def words as difficult, but not use the CL, the CL is another thing.
words_count | entries_count |
---|---|
1 | 27137 |
2 | 44083 |
3 | 34548 |
4 | 19869 |
5 | 11110 |
6 | 6389 |
7 | 4273 |
8 | 2870 |
9 | 2126 |
10 | 1539 |
11 | 1286 |
12 | 1005 |
13 | 862 |
14 | 689 |
15 | 585 |
16 | 531 |
17 | 452 |
18 | 408 |
19 | 352 |
20 | 312 |
21 | 244 |
22 | 240 |
23 | 212 |
24 | 196 |
25 | 162 |
26 | 170 |
27 | 129 |
28 | 138 |
29 | 126 |
30 | 124 |
31 | 87 |
32 | 92 |
33 | 78 |
34 | 71 |
35 | 68 |
36 | 56 |
37 | 65 |
38 | 57 |
39 | 44 |
40 | 45 |
41 | 48 |
42 | 33 |
43 | 33 |
44 | 41 |
45 | 42 |
46 | 31 |
47 | 34 |
48 | 28 |
49 | 36 |
50 | 21 |
51 | 15 |
52 | 17 |
53 | 25 |
54 | 16 |
55 | 20 |
56 | 17 |
57 | 12 |
58 | 15 |
59 | 17 |
60 | 14 |
61 | 13 |
62 | 13 |
63 | 11 |
64 | 9 |
65 | 9 |
66 | 5 |
67 | 8 |
68 | 8 |
69 | 5 |
70 | 10 |
71 | 16 |
72 | 9 |
73 | 6 |
74 | 6 |
75 | 6 |
76 | 6 |
77 | 7 |
78 | 1 |
79 | 6 |
80 | 2 |
81 | 12 |
83 | 1 |
84 | 5 |
85 | 3 |
86 | 4 |
87 | 3 |
88 | 2 |
89 | 3 |
90 | 1 |
91 | 1 |
92 | 5 |
93 | 7 |
94 | 2 |
95 | 4 |
96 | 4 |
97 | 3 |
98 | 1 |
99 | 3 |
100 | 1 |
102 | 1 |
103 | 2 |
104 | 1 |
105 | 3 |
106 | 2 |
107 | 1 |
108 | 1 |
110 | 1 |
111 | 4 |
112 | 3 |
114 | 1 |
115 | 1 |
116 | 1 |
117 | 1 |
118 | 3 |
119 | 1 |
120 | 2 |
121 | 3 |
122 | 1 |
123 | 1 |
126 | 1 |
129 | 1 |
131 | 1 |
134 | 1 |
135 | 1 |
136 | 4 |
137 | 1 |
140 | 1 |
146 | 2 |
148 | 1 |
149 | 2 |
150 | 1 |
153 | 1 |
154 | 1 |
155 | 1 |
156 | 1 |
163 | 1 |
173 | 1 |
176 | 1 |
179 | 1 |
182 | 1 |
184 | 1 |
188 | 1 |
189 | 1 |
195 | 2 |
196 | 1 |
197 | 1 |
199 | 1 |
200 | 1 |
204 | 2 |
211 | 1 |
212 | 1 |
219 | 1 |
224 | 1 |
231 | 1 |
232 | 1 |
234 | 2 |
237 | 1 |
241 | 1 |
242 | 1 |
249 | 1 |
252 | 2 |
253 | 1 |
261 | 1 |
298 | 1 |
311 | 1 |
330 | 1 |
333 | 1 |
353 | 1 |
451 | 1 |
Well, I've just checked some of these "one word definitions", here's an example:
(က)မဖွဲ့ချည်မူ၍။(ခ)မချည်နှောင်-မပိတ်ဖုံး-မူ၍။(ဂ)မဖွဲ့-မမံ-မကျံ-မူ၍။
This doesn't seem like one word, so my way of finding how many words it contains is bad, I'm only splitting by single white space. Maybe should split by (
too, please check with someone who speaks Myanmar.
Splitting by ( produced this for the defs with up to 5 words, still quite a few entries with defs with <= 5 words: |
words_count | entries_count |
---|---|---|
1 | 23018 | |
2 | 39110 | |
3 | 33580 | |
4 | 20859 | |
5 | 12342 |
But then there are entries like this:
ကြိမ်ဖန်များစွာမပြု-မလေ့လာ-မပွါးများ-အပ်သည်၏အဖြစ်။
so maybe I should split by -
too?
Splitting by - produced this result, less words, but still about 100k where defs contians <= 5 words: |
words_count | entries_count |
---|---|---|
1 | 10635 | |
2 | 22738 | |
3 | 25448 | |
4 | 22612 | |
5 | 18854 | |
6 | 14176 | |
7 | 10200 | |
8 | 7254 | |
9 | 5272 | |
10 | 4016 |
So the longest myanmar definition in that 100k words list with defs with <= 5 words is:
အဖြစ်ပြောင်းရွှေ့ခြင်းသို့ရောက်ခြင်း၊ ဖြစ်စဉ်ပြောင်းရွှေ့ခြင်းသို့ ရောက်ခြင်း၊ သဘောအထူးပြောင်းခြင်းသို့ ရောက်ခြင်း။
which is this one:
https://pm12e.pali.tools/word/31662
whereas the longest (in terms of characters) 25 words def is this one:
https://pm12e.pali.tools/word/152623
အာဏာစက်၊(က)မည်သည့်လူတစ်ဦးတစ်ယောက်မျှ ဆီးတားကန့်ကွက်နိုင်ခြင်း မရှိ မိမိဖြစ်လိုရာဖြစ်စေနိုင်သော(အဆီးအတားမရှိ ချာချာလည်ပတ်နေသာ စက်ဝန်းနှင့်တူသော)ဘုရင့်အမိန့်အာဏာ။(ခ)မည်သည့် လူနတ်ဗြဟ္မာ တစ်ဦးတစ်ယောက်မျှ မကန့်ကွက် မပယ်ဖျက်နိုင်သော(အဆီးအတား မရှိ ချာချာလည်ပတ်နေသော စက်ဝန်းနှင့်တူသော)မြတ်စွာဘုရား အမိန့်အာဏာတော်၊ ပဌမပါရာဇိကစသော သိက္ခာပုဒ် ဥပဒေတော်များနှင့် ရှောင်ကြဉ်ရန် ကျင့်သုံးရန် မိန့်မြွက်တော်မူအပ်သော ဒေသနာတော်များ။
I will close for now.. We could go <8 for now.. but i'm happy.
only easy words for editors.
basically anything that has less than 15 words in the myanmar def is easy. We might reduce that.
The english google has google garbage in it so we cannot judge by this.
HOW to do easily We could.. mark all words over 25 as "long" category in an update query. It looks like myanmar words are very long and only a few of them.. so we would need to do by character length so 100 char length (similar to the line size above this line)
Do we have a category or do we use CL codes to do this.. (like difficult and middle).? We can just have a CL code if needed for that.. (negative numbers allowed?)
Then assign the experts the long group.
Odds are they cannot translate the long ones.. if they can.. does not matter. we give to experts anyway. later.. if we finish the simple words.. we can recycle the long words with the lay people and they can try them.