UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

Standardize function words and capitalization in MD and AECD to match conventions followed in CW #121

Open aarppe opened 3 days ago

aarppe commented 3 days ago

(From #99) We would want to implement some form of standardization of the ALTLab versions of the definitions from MD and AECD. For instance, removing the initial article, i.e. a, an, the, and lower-casing initial pronouns (e.g. S/he -> s/he in AECD , perhaps even generalizing the masculine He in MD to s/he as in CW. We might do an initial pass of this programmatically, and always keeping the original definition for reference, but then having a standardized version for public consumption in itwêwina.

aarppe commented 3 days ago

It turns out that there are 490 initial word types in the English definitions, which in reverse frequency order are enumerated below. This suggests that we can standardize a large part of the English definitions automatically, and the possible proper names that need to remain capitalized can be reviewed manually.

cat crk/dicts/AECD_crk2eng_ALTLab.tsv| gawk -F"\t" 'NR>=2 { def=$7; split(def, w, " "); n[w[1]]++; ntot++; } END { PROCINFO["sorted_in"]="@val_num_desc"; for(i in n) { cumul+=n[i]; printf "%4.1f\t%4.1f\t%5i\t%s\n", cumul*100/ntot, n[i]*100/ntot,  n[i], i; } }'       

54.8    54.8     5738   S/he
68.1    13.4     1399   A
78.7    10.5     1104   The
85.8     7.2      752   It
87.7     1.9      196   An
89.3     1.6      170   Being
90.1     0.8       82   Her/his
90.7     0.6       59   They
91.2     0.5       52   There
91.7     0.5       52   My
92.2     0.5       51   Something
92.6     0.4       45   Having
92.9     0.3       32   In
93.2     0.3       29   One
93.4     0.2       22   He
93.5     0.1       15   She
93.7     0.1       14   Someone
93.8     0.1       12   All
93.9     0.1       11   What
94.0     0.1        9   On
94.0     0.1        7   Your
94.1     0.1        7   Two
94.1     0.1        6   Three
94.2     0.1        6   Last
94.3     0.1        6   Four
94.3     0.0        5   Along
94.3     0.0        4   Too
94.4     0.0        4   To
94.4     0.0        4   That
94.5     0.0        4   Right
94.5     0.0        4   Our
94.5     0.0        4   December;
94.6     0.0        4   Close
94.6     0.0        4   As
94.7     0.0        4   
94.7     0.0        3   Used
94.7     0.0        3   This
94.7     0.0        3   Their
94.8     0.0        3   Sweet
94.8     0.0        3   Second
94.8     0.0        3   People
94.9     0.0        3   Once
94.9     0.0        3   Oak
94.9     0.0        3   Numerous
94.9     0.0        3   Not
95.0     0.0        3   Never.
95.0     0.0        3   Medicine
95.0     0.0        3   Just
95.1     0.0        3   June;
95.1     0.0        3   Ice
95.1     0.0        3   His/her
95.1     0.0        3   Coffee.
95.2     0.0        3   Birch
95.2     0.0        3   Before
95.2     0.0        3   At
95.3     0.0        3   Across
95.3     0.0        2   it
95.3     0.0        2   Wrapping
95.3     0.0        2   With
95.3     0.0        2   When
95.4     0.0        2   Usually.
95.4     0.0        2   Tree
95.4     0.0        2   Those
95.4     0.0        2   Thinking
95.4     0.0        2   Ten
95.4     0.0        2   Taking
95.5     0.0        2   Syllabic
95.5     0.0        2   Sugar.
95.5     0.0        2   Somebody
95.5     0.0        2   Six
95.5     0.0        2   Sacred
95.6     0.0        2   Referring
95.6     0.0        2   Question
95.6     0.0        2   Pubic
95.6     0.0        2   Persuasion.
95.6     0.0        2   Pepper.
95.7     0.0        2   Peanut
95.7     0.0        2   Over
95.7     0.0        2   Out
95.7     0.0        2   Okay
95.7     0.0        2   Nothing
95.8     0.0        2   Next
95.8     0.0        2   More;
95.8     0.0        2   May;
95.8     0.0        2   Material
95.8     0.0        2   Many
95.8     0.0        2   Making
95.9     0.0        2   Just.
95.9     0.0        2   Its
95.9     0.0        2   If
95.9     0.0        2   Honey.
95.9     0.0        2   Her
96.0     0.0        2   From
96.0     0.0        2   Freshly
96.0     0.0        2   Foolish
96.0     0.0        2   Following
96.0     0.0        2   Five
96.1     0.0        2   Fish
96.1     0.0        2   Everything
96.1     0.0        2   Eleven
96.1     0.0        2   Eight
96.1     0.0        2   Delight;
96.2     0.0        2   Daylight
96.2     0.0        2   Dampness;
96.2     0.0        2   Concrete;
96.2     0.0        2   Cheese;
96.2     0.0        2   Carefully;
96.2     0.0        2   Canned
96.3     0.0        2   Brothers
96.3     0.0        2   Both
96.3     0.0        2   Bone
96.3     0.0        2   Behind;
96.3     0.0        2   August;
96.4     0.0        2   Ashcan.
96.4     0.0        2   April;
96.4     0.0        2   And
96.4     0.0        2   Always;
96.4     0.0        2   Already.
96.5     0.0        2   Acting
96.5     0.0        1   s/he
96.5     0.0        1   an
96.5     0.0        1   a
96.5     0.0        1   Yesterday.
96.5     0.0        1   Yellow
96.5     0.0        1   Wrong;
96.5     0.0        1   Wood
96.5     0.0        1   Women
96.5     0.0        1   With.
96.6     0.0        1   Wine;
96.6     0.0        1   Wild
96.6     0.0        1   Who.
96.6     0.0        1   White
96.6     0.0        1   Whiskers;
96.6     0.0        1   Wheat
96.6     0.0        1   West
96.6     0.0        1   We,
96.6     0.0        1   Way
96.6     0.0        1   Water;
96.6     0.0        1   Water
96.7     0.0        1   Very;
96.7     0.0        1   Venerable,
96.7     0.0        1   Urine;
96.7     0.0        1   Upland.
96.7     0.0        1   Up
96.7     0.0        1   Uninhabited
96.7     0.0        1   Unfortunate
96.7     0.0        1   Under
96.7     0.0        1   Two.
96.7     0.0        1   Twice,
96.8     0.0        1   Twenty.
96.8     0.0        1   Twenty
96.8     0.0        1   Twelve.
96.8     0.0        1   Twelve
96.8     0.0        1   Trust.
96.8     0.0        1   Trapping.
96.8     0.0        1   Translucency.
96.8     0.0        1   Towards
96.8     0.0        1   Tomorrow.
96.8     0.0        1   Today.
96.9     0.0        1   Tobacco
96.9     0.0        1   Time
96.9     0.0        1   Three.
96.9     0.0        1   Thousand;
96.9     0.0        1   Those.
96.9     0.0        1   Thirty.
96.9     0.0        1   Thirty
96.9     0.0        1   Thirteen.
96.9     0.0        1   Thirteen
96.9     0.0        1   Things
96.9     0.0        1   Thigh
97.0     0.0        1   These
97.0     0.0        1   Then;
97.0     0.0        1   Them.
97.0     0.0        1   Thanksgiving
97.0     0.0        1   Ten.
97.0     0.0        1   Tea.
97.0     0.0        1   Sweetened
97.0     0.0        1   Suspended
97.0     0.0        1   Suppertime.
97.0     0.0        1   Summertime.
97.1     0.0        1   Suddenly;
97.1     0.0        1   Strong
97.1     0.0        1   Strings
97.1     0.0        1   Strength.
97.1     0.0        1   Strength
97.1     0.0        1   Sticky
97.1     0.0        1   State
97.1     0.0        1   Stagnant
97.1     0.0        1   Spruce
97.1     0.0        1   Spongy
97.1     0.0        1   Spoilage;
97.2     0.0        1   Spittle;
97.2     0.0        1   South;
97.2     0.0        1   Soothingly
97.2     0.0        1   Soon.
97.2     0.0        1   Somewhere
97.2     0.0        1   Sometime
97.2     0.0        1   Someone.
97.2     0.0        1   Some
97.2     0.0        1   So.
97.2     0.0        1   Slow
97.3     0.0        1   Slimy
97.3     0.0        1   Skeptical.
97.3     0.0        1   Sixty.
97.3     0.0        1   Sixty
97.3     0.0        1   Sixteen.
97.3     0.0        1   Sixteen
97.3     0.0        1   Six.
97.3     0.0        1   Sisters
97.3     0.0        1   Sister-in-law.
97.3     0.0        1   Sheep
97.3     0.0        1   She/he,
97.4     0.0        1   Several.
97.4     0.0        1   Several
97.4     0.0        1   September;
97.4     0.0        1   Semen.
97.4     0.0        1   Self-forgiveness,
97.4     0.0        1   Self-abandonment.
97.4     0.0        1   Seaweed.
97.4     0.0        1   Savannah
97.4     0.0        1   Sandhill.
97.4     0.0        1   Sand.
97.5     0.0        1   Salty
97.5     0.0        1   Salt.
97.5     0.0        1   Rough
97.5     0.0        1   Rotten
97.5     0.0        1   Rites
97.5     0.0        1   Respected;
97.5     0.0        1   Rendered
97.5     0.0        1   Remission
97.5     0.0        1   Red
97.5     0.0        1   Rebellion.
97.5     0.0        1   Reading
97.6     0.0        1   Rainwater.
97.6     0.0        1   Ragweed;
97.6     0.0        1   Protective
97.6     0.0        1   Produce
97.6     0.0        1   Procedures.
97.6     0.0        1   Pretty
97.6     0.0        1   Prairie;
97.6     0.0        1   Pine
97.6     0.0        1   Pickings.
97.6     0.0        1   Periodic;
97.7     0.0        1   Perhaps,
97.7     0.0        1   Pemmican
97.7     0.0        1   Parkinson's
97.7     0.0        1   Paper.
97.7     0.0        1   Paper
97.7     0.0        1   Paint.
97.7     0.0        1   Overeating
97.7     0.0        1   Overdoing.
97.7     0.0        1   Outside.
97.7     0.0        1   Or;
97.7     0.0        1   Or
97.8     0.0        1   Open
97.8     0.0        1   Onto,
97.8     0.0        1   Only
97.8     0.0        1   One_hundred
97.8     0.0        1   One.
97.8     0.0        1   Once,
97.8     0.0        1   Ointment.
97.8     0.0        1   Oftentimes.
97.8     0.0        1   Off
97.8     0.0        1   Numerous;
97.9     0.0        1   Numerous.
97.9     0.0        1   No.
97.9     0.0        1   No
97.9     0.0        1   Nine.
97.9     0.0        1   Nil;
97.9     0.0        1   Nightlighting.
97.9     0.0        1   Newly;
97.9     0.0        1   New
97.9     0.0        1   Near.
97.9     0.0        1   Nakedness.
97.9     0.0        1   Muskeg.
98.0     0.0        1   Muskeg
98.0     0.0        1   Mucus.
98.0     0.0        1   Much;
98.0     0.0        1   Mouse
98.0     0.0        1   Moose
98.0     0.0        1   Million;
98.0     0.0        1   Melancholic,
98.0     0.0        1   Meat.
98.0     0.0        1   Measles.
98.0     0.0        1   Me;
98.1     0.0        1   Me
98.1     0.0        1   Maybe;
98.1     0.0        1   March;
98.1     0.0        1   Maple
98.1     0.0        1   Malediction
98.1     0.0        1   Mainly;
98.1     0.0        1   Lye.
98.1     0.0        1   Lunch.
98.1     0.0        1   Long
98.1     0.0        1   Liturgical.
98.1     0.0        1   Lightning;
98.2     0.0        1   Let
98.2     0.0        1   Leprosy;
98.2     0.0        1   Left,
98.2     0.0        1   Leaning
98.2     0.0        1   Leading
98.2     0.0        1   Lately.
98.2     0.0        1   Kind
98.2     0.0        1   July;
98.2     0.0        1   Jealousy.
98.2     0.0        1   Jackpine
98.3     0.0        1   Isolated
98.3     0.0        1   Iron
98.3     0.0        1   Intuition;
98.3     0.0        1   Interior
98.3     0.0        1   Intercourse;
98.3     0.0        1   Inspiration
98.3     0.0        1   Inside
98.3     0.0        1   Ink.
98.3     0.0        1   Imitation
98.3     0.0        1   Illumination;
98.3     0.0        1   Ice;
98.4     0.0        1   I'm
98.4     0.0        1   Hyperactive.
98.4     0.0        1   Hunting
98.4     0.0        1   Humanity.
98.4     0.0        1   Horseflesh.
98.4     0.0        1   Hoping
98.4     0.0        1   Hope;
98.4     0.0        1   Honey
98.4     0.0        1   Homemade
98.4     0.0        1   Holy
98.5     0.0        1   His
98.5     0.0        1   Hills
98.5     0.0        1   Here.
98.5     0.0        1   Herbs.
98.5     0.0        1   Heaven.
98.5     0.0        1   Hay.
98.5     0.0        1   Hardened
98.5     0.0        1   Hard
98.5     0.0        1   Halloween.
98.5     0.0        1   Half.
98.5     0.0        1   Half
98.6     0.0        1   Gunpowder.
98.6     0.0        1   Guesswork.
98.6     0.0        1   Great-grandchild;
98.6     0.0        1   Grasshead.
98.6     0.0        1   Gradually.
98.6     0.0        1   Gold
98.6     0.0        1   Goes
98.6     0.0        1   God's
98.6     0.0        1   God
98.6     0.0        1   Glass;
98.7     0.0        1   Furthermore;
98.7     0.0        1   Fresh
98.7     0.0        1   Frequently.
98.7     0.0        1   Fourteen.
98.7     0.0        1   Fourteen
98.7     0.0        1   Four.
98.7     0.0        1   Forty.
98.7     0.0        1   Forty
98.7     0.0        1   Forgiveness.
98.7     0.0        1   Food.
98.7     0.0        1   Flour.
98.8     0.0        1   Five.
98.8     0.0        1   First;
98.8     0.0        1   Fighting
98.8     0.0        1   Fifty.
98.8     0.0        1   Fifty
98.8     0.0        1   Fifteen.
98.8     0.0        1   Fifteen
98.8     0.0        1   Feverish,
98.8     0.0        1   February;
98.8     0.0        1   Fat.
98.9     0.0        1   Farthermost.
98.9     0.0        1   Far,
98.9     0.0        1   Far
98.9     0.0        1   False
98.9     0.0        1   Exploding
98.9     0.0        1   Expired
98.9     0.0        1   Excrement.
98.9     0.0        1   Exclamation
98.9     0.0        1   Excessiveness;
98.9     0.0        1   Exaltation;
99.0     0.0        1   Evildoing.
99.0     0.0        1   Everywhere.
99.0     0.0        1   Everyone
99.0     0.0        1   Even
99.0     0.0        1   Especially.
99.0     0.0        1   Eradication
99.0     0.0        1   Equipment;
99.0     0.0        1   Epsom
99.0     0.0        1   Enlarged
99.0     0.0        1   Enjoyment
99.0     0.0        1   Encircling
99.1     0.0        1   Eleven.
99.1     0.0        1   Eighty.
99.1     0.0        1   Eighty
99.1     0.0        1   Eighteen.
99.1     0.0        1   Eighteen
99.1     0.0        1   Eight.
99.1     0.0        1   Easter.
99.1     0.0        1   East,
99.1     0.0        1   Earwax.
99.1     0.0        1   Early.
99.2     0.0        1   Each
99.2     0.0        1   During
99.2     0.0        1   Dung.
99.2     0.0        1   Duck
99.2     0.0        1   Dried
99.2     0.0        1   Downstream.
99.2     0.0        1   Down
99.2     0.0        1   Don't.
99.2     0.0        1   Disobedience.
99.2     0.0        1   Discarded
99.2     0.0        1   Diarrhea.
99.3     0.0        1   Despite,
99.3     0.0        1   Denying;
99.3     0.0        1   Delusion;
99.3     0.0        1   Dehydration.
99.3     0.0        1   Dead
99.3     0.0        1   Currently.
99.3     0.0        1   Crusty
99.3     0.0        1   Cream;
99.3     0.0        1   Craziness;
99.3     0.0        1   Countenance;
99.4     0.0        1   Contents;
99.4     0.0        1   Considered
99.4     0.0        1   Conscious
99.4     0.0        1   Congratulation;
99.4     0.0        1   Concentration.
99.4     0.0        1   Compassion,
99.4     0.0        1   Communion
99.4     0.0        1   Cloth;
99.4     0.0        1   Cloth
99.4     0.0        1   Closer
99.4     0.0        1   Clearly,
99.5     0.0        1   Clear
99.5     0.0        1   Christmas
99.5     0.0        1   Christian
99.5     0.0        1   Cement.
99.5     0.0        1   Cedar.
99.5     0.0        1   Catholic
99.5     0.0        1   Calico,
99.5     0.0        1   Cache.
99.5     0.0        1   But.
99.5     0.0        1   Burned
99.6     0.0        1   Built
99.6     0.0        1   Brome
99.6     0.0        1   Boastfulness;
99.6     0.0        1   Blue
99.6     0.0        1   Blinders.
99.6     0.0        1   Below
99.6     0.0        1   Before;
99.6     0.0        1   Before,
99.6     0.0        1   Beef.
99.6     0.0        1   Bee.
99.6     0.0        1   Bedclothes.
99.7     0.0        1   Beaver.
99.7     0.0        1   Beaver
99.7     0.0        1   Barren
99.7     0.0        1   Barded
99.7     0.0        1   Bannock.
99.7     0.0        1   Balsam
99.7     0.0        1   Baking
99.7     0.0        1   Bad
99.7     0.0        1   Back
99.7     0.0        1   Assets;
99.8     0.0        1   Ashes.
99.8     0.0        1   Ascension
99.8     0.0        1   Artificial
99.8     0.0        1   Appropriation,
99.8     0.0        1   Any
99.8     0.0        1   Antler;
99.8     0.0        1   Anterior;
99.8     0.0        1   Annexation;
99.8     0.0        1   Anger
99.8     0.0        1   Among
99.8     0.0        1   Altogether
99.9     0.0        1   Although.
99.9     0.0        1   Alternately,
99.9     0.0        1   Also;
99.9     0.0        1   Almost;
99.9     0.0        1   All;
99.9     0.0        1   Alabaster.
99.9     0.0        1   Ahead
99.9     0.0        1   Again,
99.9     0.0        1   After;
99.9     0.0        1   Adjacency.
100.0    0.0        1   Act
100.0    0.0        1   Achievement.
100.0    0.0        1   About
100.0    0.0        1   Aboriginal
100.0    0.0        1   Abode.
100.0    0.0        1   Abase.
aarppe commented 3 days ago

The following script gives the initial words also when splitting definitions into their constituent senses:

cat crk/dicts/AECD_crk2eng_ALTLab.tsv| gawk -F"\t" 'NR>=2 { def=$7; n=split(def, sense, ";[ ]*"); for(i=1; i<=n; i++) { split(sense[i], w, " "); N[w[1]]++; ntot++; } } END { PROCINFO["sorted_in"]="@val_num_desc"; for(i in N) { cumul+=N[i]; printf "%4.1f\t%4.1f\t%5i\t%s\n", cumul*100/ntot, N[i]*100/ntot, N[i], i; } }' | less