dhowe / RiTaV1

RiTa: the generative language toolkit
http://rednoise.org/rita
GNU General Public License v3.0
354 stars 78 forks source link

Updates to RiTa dictionary #366

Closed dhowe closed 6 years ago

dhowe commented 7 years ago
  1. Plurals
    • Remove all 'nns' tags from words that also have 'nn'
    • If word has 'nns' and not 'nn', then convert it to 'nn' (change word/phonemes/etc).
  2. Verb Base Forms
    • If verb exists as vb* without 'vb', then add entry for 'vb' (new word/phonemes/etc).
    • if verb tagged only with vb* only, manually check for adjective or other mistake
    • when finished, every verb should have a 'vb'
  3. Discuss/consider deleting all vb*
cqx931 commented 7 years ago

1.1 All the entries that have both 'nns' & 'nn' Total: 145

52 of them end with 's' and have corresponding 'nn' in the dict I think we should delete these. If they contain other tags, it seems like mistakes for me.

darts nn nns looks vbz nns nn pants nns nn narcotics nns nn teens nns nn gains nns nn vbz books nns nn pos fares nns nn vbz fiberglass nns nn pickings nns nn likes vbz nns nn premises nns nn links nns vbz nn closes vbz nns nn means vbz nns nn shares nns nn vbz scours nns nn audits nn nns dislikes vbz nn nns commissions nns vbz nn meters nns nn stands vbz nns nn savings nns nn bureaus nn nns receivables nn nns dilemmas nn nns communications nns nn barracks nn nns fines nns nn cuts nns vbz nn drachmas nns nn calls vbz nns nn ounces nns nn draughts nn nns oats nns nn outfits nn nns offers vbz nns nn logos nns nn contracts nns nn vbz inches nns nn vbz deliveries nns nn


nouns always in plural form -> delete 'nns' annals nns nn suds nns nn doldrums nns nn series nn nns physics nn nns whereabouts nn nns innards nns nn riches nns nn gallows nn nns avionics nns nn waterworks nn nns mathematics nns nn headquarters nn nns miniseries nns nn species nn nns sunglasses nn nns metaphysics nns nn crossroads nns nn chassis nn nns personnel nns jj nn feet nns nn buffalo nn nns french jj nn nns trivia nns nn data nns nn chili nn nns aircraft nn nns elite nn nns jj people nns nn fish nn vb nns cannon nn nns deer nn nns clergy nn nns salami nns nn salespeople nn nns many jj dt nn rb vb nns pdt dice nns nn palazzi nns nn folk nn nns rabbi nn nns magnolia nn nns piranha nn nns flora nns nn offspring nn nns shellfish nn nns fauna nns nn salmon nn nns macaroni nns nn antelope nn nns sheep nn nns police nn vb nns rich jj nns nn alkali nns nn paraphernalia nns nn agenda nn nns cattle nns nn elk nns nn trout nn nns spaghetti nns nn intelligentsia nn nns caribou nn nns wart nn nns hog nn nns tsunami nn nns grapefruit nn nns won vbd nn nns vbn news nn nns piles nns nn vbz billiards nn nns earnings nns nn

yen nns nn yuan nn nns kronor nns nn lira nn nns

microelectronics nns nn electronics nns nn aerobics nn nns ethics nns nn economics nns nn politics nns nn graphics nns nn hydraulics nns nn telecommunications nns jj nn


This rule shall be singularized with a tag of 'nn', as no corresponding nn can befound sweepstakes nn nns schoolchildren nn vb nns


Only NNS one, just delete them and add to nonregular plural list? Or how? media nns nn millennia nn nns consortia nns nn pence nn nns


delete these c nn ls nnp nns i nn nnp nns l nn nns


not sure how to handle these jitters nns nn hoss nn nns do vbp nn vb nns vbz clientele nn nns rand nn nns

young jj nn nns muni nn jj nns few jj nn rb nns blind jj nns nn vb invalid jj nn nns due jj in nn rb nns ill jj nns nn rb

dhowe commented 7 years ago

jitters nn vbz hoss nn do vb clientele nn rand Delete

young jj muni Delete few jj blind jj vb invalid jj nn due jj nn ill jj rb

cqx931 commented 7 years ago

And these irregular purals? Delete and only keep the nn?Or keep them? media nns nn millennia nn nns consortia nns nn concerti nns consortia nns nn teeth nns minutiae nns septa nns millennia nn nns swine nns termini nns cognoscenti nns larvae nns vertebrae nns memorabilia nns media nns nn pepperoni nns alveoli nns timpani nns lice nns multimedia nns sera nns maria nns

dhowe commented 7 years ago

Good question -- of course we need to be able to correctly tag these ... See #368

cqx931 commented 7 years ago

here comes the word list that needs to be added into the dictionary, please check. Total:633

abduct abet abridge accost accredit accustom acidify addle adjoin adjudge admonish adulterate aggrandize aggrieve ail allude amalgamate amputate analyse annotate annualize annul appalle apprise arraign assort astonish attune backdate backpedal backslap backtrack badger bandy baptize bask bawl bead befit befriend befuddle beguile behead belabor beleaguer bely berate bespeak bewilder bewitch bicker billow blacken blare blindside boggle botch braid brainwash brandish brutalize bungle burgeon burnish burp calcify calibrate calve capitulate captivate catalyze chasten chastise cheapen chime chirp choreograph chortle clamber clank clasp clench cloy coagulate cobble codify collateralize collide colonize commandeer commingle condescend congest consort constrict convolute corrugate countersue countervail creak croon crucify crumple cuddle customize daub daunt dawdle deactivate deafen debase debilitate decaffeinate decamp decant decease decertify decimate declaim decommission defame deform dehumanize dehydrate demean dement demote denationalize denominate denuclearize denude depreciate derange desecrate destine dethrone dilapidate disassemble disavow disburse disclaim discolor disconcert disenchant disenfranchise disfigure disgruntle dishearten disillusion disincline disjoint disorganize disorient disown displease dispossess dissatisfy dissect distill dither divvy doff domineer doze dredge drool drub dub dupe elate electrify elucidate elude emaciate emanate embattle embed embitter emblazon embolden emboss embroil encamp encase enchant enclose encode encrust endear engross enmesh enrapture ensconce enshrine enslave ensnarl entangle enthrall entomb entwine enumerate enunciate err estrange evince exasperate excommunicate excoriate exhale exhilarate expound expunge extenuate exterminate extoll extrude exult feign fertilize festoon fetter fictionalize fidget filch fillet finalize firebomb fireproof fissure fixate fizzle flabbergast flagellate fledge fluster foist frazzle fritter fume garble gentrify glisten glower gnash goof gravitate grieve grimace gussy halogenate handcraft handpick harangue harrow haw headquarter hearten heckle hew hijack homogenize hoodwink horrify hospitalize huff humiliate hurtle hydrolyze hyphenate hypnotize idealize illumine imbed imbue immerse immigrate immobilize immortalize immunize impale impeach impel impend impoverish impute inactivate inaugurate incapacitate incriminate incubate individualize indoctrinate induct inflect inhale inscribe institutionalize inter interdict interlace interlink interlock intermix interpolate interrelate intersperse intertwine interweave intone introvert inundate inure invert ionize irradiate jag jiggle jilt jimmy jive joust juxtapose knead lactate laminate laud leaven liken lionize localize loiter loll lubricate madden magnetize malign maltreat mangle manhandle manicure marginalize marinate meander mechanize memorialize mesmerize mete miniaturize misallocate misapply misappropriate misbehave miscalculate mischarge misconstrue misguide mishandle misinform misjudge mislead mismanage misperceive misprice misquote misspell misstate moisten molt moor moralize motorize mottle muffle multitask mummify mutilate mystify nag narrate naturalize nauseate nestle nettle nick nix obligate obsess obtrude ogle opine oppress orient oscillate ostracize outdate outdistance outgain outmode outscore outstretch overarch overbill overcook overcrowd overdo overextend overfund overplay overpower overprice overrate overregulate oversimplify overspend overstep oversubscribe overtax overvalue overwork pander panelize parch parlay parse pasteurize peeve pend perplex personify perturb petrify philander pilfer pillory plagiarize plop plunk poach pockmark postmark prance prearrange precook predate predicate preen prefabricate prejudge preordain prepackage prerecord pressurize presuppose profiteer promulgate proofread prophesy propound prorate proselytize protract pulsate punctuate puncture purloin purr quiver radicalize rankle rant rarefy raze reactivate readmit reappoint reawaken rebalance recalculate recant reclassify recline recondition reconvene recuse redecorate redistrict redline reelect reemerge refit reformulate refrigerate reincarnate reincorporate reinstitute reintegrate rejigger relegate remand reminisce remit remodel reoffer repaint replant reprimand repulse requisition reroute restage restyle resurface resurge retool retrench rev revalue revere revile rework ricochet rivet rubberize ruminate rustle sadden sanitize satirize scald scandalize scavenge schmooze scorch scowl sculpt seclude sensitize serialize shoehorn shoo shortchange sicken situate slay sled sleepwalk slither sneeze snicker snooze snore spatter splatter sprain squirm stagnate stomp stonewall stow straggle strangle stratify stun stylize suffuse sulk superimpose swindle swish synchronize tantalize tatter teem teeter televise temporize tether thieve throb tickle ting tingle tinkle titillate totter transfuse transpire transpose traumatize trounce truncate tweak twig twirl unbalance underfund underprice underrate underreport undersell underuse underutilize undervalue unfurl unhinge unionize unnerve unsettle untie upholster upstage urbanize vandalize vaunt vend venerate ventilate victimize vilify waggle wangle warble wheeze whet whirr whisk whizz womanize wow wrangle writhe zero zigzag

dhowe commented 7 years ago

these are new 'vb' entries?

cqx931 commented 7 years ago

yes

cqx931 commented 7 years ago

I have just checked the latest CMU dictionary(cmudict-0.7b) and found 600 words in it.For the rest 33 words I can just check manually. As for POS tag, these words shall all have 'vb' as a tag, but there are also cases when the word is 'nn' as well. But it shall be enough that we just add them all as 'vb' at this time?

dhowe commented 7 years ago

Do you mean that they are listed in CMU as nn or nns as well? And they don't exist at all in our dict?

cqx931 commented 7 years ago

600 words in this list exist already in cmudict-0.7b, if I check the CMU dict we have in RiTaJS (cmudict-0.6) there are 10 words less, but not a big difference. CMU dict only contains the pronunciation right? In DictFromCMU file I can see it combining cmuPhones and ritaPos to generate a new entry.

The 'nn' issue shall only applicable to a few cases, most of the words are just 'vb'.

dhowe commented 7 years ago

Ok, so can we make sure the 'nn' tags end up in our dict as well?

cqx931 commented 7 years ago

I can do that as the next step, when I check the 'nns' that are not deleted because they don't have corresponding 'nn' in the dictionary.

dhowe commented 7 years ago

👍

cqx931 commented 7 years ago

So up to this point, all vb* shall have a vb in the dictionary. A few thoughts about deleting vb* for discussion: 1.getVerbBaseForm() this will be used later in the tagger as well as the deleting process. ->get the stem from porter stemmer ->for vb ends with "ent", "ion", "er", "cate", "ize", just get rid of the vb* ending (other cases will be ignored and be left in the dictionary)

2.Delete Entry -> if it is vb* only entry -> if we can find the getVerbBaseForm() in dictionary with tag "vb" -> delete the entry

3.Delete Tag -> if an entry contains vb* as well as other tags -> if we can find the getVerbBaseForm() in dictionary with tag "vb" -> delete the vb* tag

4.Implement the tagger if a word can't be find in the dictionary && getVerbBaseForm(word) is in the dictionary add the corresponding tag to choices[] according to the ending.

Question: for word base that is both vb and nn, are we going to tag the word base+"s" to "nns" and include "vbz" in the choices?

5.Fix other bugs in tests after deletion...

dhowe commented 7 years ago

Good summary. The first question, before we go down this road, is how many words can be deleted. Do you have an exact number? We should only undertake this work if the number is high enough (>20%). Comments follow...

1.getVerbBaseForm() this will be used later in the tagger as well as the deleting process. ->get the stem from porter stemmer ->for vb ends with "ent", "ion", "er", "cate", "ize", just get rid of the vb* ending (other cases will be ignored and be left in the dictionary)

if it doesn't exist already, we need a function somewhere called getVerbBaseForm(verbStr), and then create a full set of tests for it.

2.Delete Entry -> if it is vb* only entry -> if we can find the getVerbBaseForm() in dictionary with tag "vb" -> delete the entry

good

3.Delete Tag -> if an entry contains vb* as well as other tags -> if we can find the getVerbBaseForm() in dictionary with tag "vb" -> delete the vb* tag

good

4.Implement the tagger if a word can't be find in the dictionary && getVerbBaseForm(word) is in the dictionary add the corresponding tag to choices[] according to the ending. Question: for word base that is both vb and nn, are we going to tag the word base+"s" to "nns" and include "vbz" in the choices?

sounds right

cqx931 commented 7 years ago

for the current getVerbBaseForm() I have, I can delete 5012 words. Roughly 18%. And after that we should be able to get the file size under 1MB.

2523 words could get the vb* tag deleted.

dhowe commented 7 years ago

Ok, lets consider this while switching focus, for the moment, to AdLiPo and ChinaEye

dhowe commented 7 years ago

Note that by using msgpack, we can reduce the size (before removing any verbs) to 930k We should do some testing on this...

dhowe commented 7 years ago

@cqx931 status on this ?

cqx931 commented 7 years ago

The dictionary was at the stage ready to delete words that have vb only and are not vb at the same time. But since the original intention of doing this was to reduce the size of dictionary to less than 1MB, if we could already achieve this by using msgpack, deleting vb might not be needed any more.

Shall I do some msgpack testing first?

dhowe commented 7 years ago

Lets put msgpack on hold for now, and instead run our tests on a tmp version of the dictionary with these verbs deleted, and see a) what file size we get, and b) what breaks

dhowe commented 7 years ago

dict-novb version on hold -- lets reconsider for v2.0