dhammacakka / pm12e

2 stars 0 forks source link

easy words for editors #38

Closed bksubhuti closed 3 years ago

bksubhuti commented 3 years ago

only easy words for editors.
basically anything that has less than 15 words in the myanmar def is easy. We might reduce that.

The english google has google garbage in it so we cannot judge by this.

HOW to do easily We could.. mark all words over 25 as "long" category in an update query. It looks like myanmar words are very long and only a few of them.. so we would need to do by character length so 100 char length (similar to the line size above this line)

Do we have a category or do we use CL codes to do this.. (like difficult and middle).? We can just have a CL code if needed for that.. (negative numbers allowed?)

Then assign the experts the long group.

Odds are they cannot translate the long ones.. if they can.. does not matter. we give to experts anyway. later.. if we finish the simple words.. we can recycle the long words with the lay people and they can try them.

sumbodhi commented 3 years ago

maybe we can do even lower than 15, for example there are ~140k entries with myanmar definition containing <= 5 words... let me know what you think, and yes, one way to do it, to auto mark untranslated words with > 6 def words as difficult, but not use the CL, the CL is another thing.

words_count entries_count
1 27137
2 44083
3 34548
4 19869
5 11110
6 6389
7 4273
8 2870
9 2126
10 1539
11 1286
12 1005
13 862
14 689
15 585
16 531
17 452
18 408
19 352
20 312
21 244
22 240
23 212
24 196
25 162
26 170
27 129
28 138
29 126
30 124
31 87
32 92
33 78
34 71
35 68
36 56
37 65
38 57
39 44
40 45
41 48
42 33
43 33
44 41
45 42
46 31
47 34
48 28
49 36
50 21
51 15
52 17
53 25
54 16
55 20
56 17
57 12
58 15
59 17
60 14
61 13
62 13
63 11
64 9
65 9
66 5
67 8
68 8
69 5
70 10
71 16
72 9
73 6
74 6
75 6
76 6
77 7
78 1
79 6
80 2
81 12
83 1
84 5
85 3
86 4
87 3
88 2
89 3
90 1
91 1
92 5
93 7
94 2
95 4
96 4
97 3
98 1
99 3
100 1
102 1
103 2
104 1
105 3
106 2
107 1
108 1
110 1
111 4
112 3
114 1
115 1
116 1
117 1
118 3
119 1
120 2
121 3
122 1
123 1
126 1
129 1
131 1
134 1
135 1
136 4
137 1
140 1
146 2
148 1
149 2
150 1
153 1
154 1
155 1
156 1
163 1
173 1
176 1
179 1
182 1
184 1
188 1
189 1
195 2
196 1
197 1
199 1
200 1
204 2
211 1
212 1
219 1
224 1
231 1
232 1
234 2
237 1
241 1
242 1
249 1
252 2
253 1
261 1
298 1
311 1
330 1
333 1
353 1
451 1
sumbodhi commented 3 years ago

Well, I've just checked some of these "one word definitions", here's an example: (က)မဖွဲ့ချည်မူ၍။(ခ)မချည်နှောင်-မပိတ်ဖုံး-မူ၍။(ဂ)မဖွဲ့-မမံ-မကျံ-မူ၍။ This doesn't seem like one word, so my way of finding how many words it contains is bad, I'm only splitting by single white space. Maybe should split by ( too, please check with someone who speaks Myanmar.

sumbodhi commented 3 years ago
Splitting by ( produced this for the defs with up to 5 words, still quite a few entries with defs with <= 5 words: words_count entries_count
1 23018
2 39110
3 33580
4 20859
5 12342
sumbodhi commented 3 years ago

But then there are entries like this: ကြိမ်ဖန်များစွာမပြု-မလေ့လာ-မပွါးများ-အပ်သည်၏အဖြစ်။ so maybe I should split by - too?

sumbodhi commented 3 years ago
Splitting by - produced this result, less words, but still about 100k where defs contians <= 5 words: words_count entries_count
1 10635
2 22738
3 25448
4 22612
5 18854
6 14176
7 10200
8 7254
9 5272
10 4016
sumbodhi commented 3 years ago

So the longest myanmar definition in that 100k words list with defs with <= 5 words is: အဖြစ်ပြောင်းရွှေ့ခြင်းသို့ရောက်ခြင်း၊ ဖြစ်စဉ်ပြောင်းရွှေ့ခြင်းသို့ ရောက်ခြင်း၊ သဘောအထူးပြောင်းခြင်းသို့ ရောက်ခြင်း။ which is this one: https://pm12e.pali.tools/word/31662

sumbodhi commented 3 years ago

whereas the longest (in terms of characters) 25 words def is this one: https://pm12e.pali.tools/word/152623 အာဏာစက်၊(က)မည်သည့်လူတစ်ဦးတစ်ယောက်မျှ ဆီးတားကန့်ကွက်နိုင်ခြင်း မရှိ မိမိဖြစ်လိုရာဖြစ်စေနိုင်သော(အဆီးအတားမရှိ ချာချာလည်ပတ်နေသာ စက်ဝန်းနှင့်တူသော)ဘုရင့်အမိန့်အာဏာ။(ခ)မည်သည့် လူနတ်ဗြဟ္မာ တစ်ဦးတစ်ယောက်မျှ မကန့်ကွက် မပယ်ဖျက်နိုင်သော(အဆီးအတား မရှိ ချာချာလည်ပတ်နေသော စက်ဝန်းနှင့်တူသော)မြတ်စွာဘုရား အမိန့်အာဏာတော်၊ ပဌမပါရာဇိကစသော သိက္ခာပုဒ် ဥပဒေတော်များနှင့် ရှောင်ကြဉ်ရန် ကျင့်သုံးရန် မိန့်မြွက်တော်မူအပ်သော ဒေသနာတော်များ။

bksubhuti commented 3 years ago

I will close for now.. We could go <8 for now.. but i'm happy.