dhowe / rita

Website, documentation and examples for RiTa
https://rednoise.org/rita
71 stars 9 forks source link

Incorrect similar words for words with 'vbz' pos #177

Open dhowe opened 2 years ago

dhowe commented 2 years ago

For example, many incorrect verb forms in list for 'spreads':

    let word = 'spreads', pos = 'vbz';

    let rhymes = RiTa.rhymes(word, { pos });
    let sounds = RiTa.soundsLike(word, { pos });
    let spells = RiTa.spellsLike(word, { pos });
KarlieZhao commented 2 years ago

I think this problem appears because there are some words with incorrect pos in dict, for example,

"computerized":["k-ah-m p-y-uw1 t-er ay-z-d","jj nn vb vbn"],
"discriminated":["d-ih s-k-r-ih1 m-ah n-ey t-ah-d","vbd jj nn vb"],
"expected":["ih-k s-p-eh1-k t-ah-d","vbn vbd jj vb"]

words like 'computerized' will be considered as base form verbs (because their pos contain 'vb') and hence, in this case where the target pos is vbz, conjugator will directly return 'computerizeds'. The easiest way to solve this might be just to modify the words' pos in dict?

KarlieZhao commented 2 years ago

https://github.com/dhowe/rita/issues/179 might be for the same reason

dhowe commented 2 years ago

good notice -- I wonder if we might be able to remove all the 'vbn' from the dictionary, since we can compute them from the base form

dhowe commented 2 years ago

So we have done this before with verb tenses (see earlier tickets from @cqx931 below). Once we find a pos that we want to remove from the dict, then we need to find all the places we would need to make updates to the code to deal with that pos (soundsLike, spellsLike, search, pos, conjguate, hasWord, tag etc.), then add tests (which will fail), then add the code to handle these cases, then remove the words with a script... then re-try the tests and adjust until the pass...

See: https://github.com/dhowe/RiTaV1/issues/536 https://github.com/dhowe/RiTaV1/issues/366 https://github.com/dhowe/RiTaV1/issues/357 https://github.com/dhowe/RiTaV1/issues/365

https://github.com/dhowe/RiTaJSv1/pull/37 https://github.com/dhowe/rita/issues/80

KarlieZhao commented 2 years ago

So here's a list of verbs with incorrect pos in the current dict, and the pos I think need to be removed/added are in the comment:

"beat": ["b-iy1-t", "vb jj nn vbd vbn vbp"], //-vbn
"become": ["b-ih k-ah1-m", "vb vbd vbn vbp"], //-vbd
"bit": ["b-ih1-t", "nn vbd vbn jj rb vb"], //-vb, -vbn
"bore": ["b-ao1-r", "vbd vbp jj nn vb"], //-vbd
"broke": ["b-r-ow1-k", "vbd vbn jj rb vb"], //-vb, -vbn
"build": ["b-ih1-l-d", "vb vbn vbp nn"], //-vbn
"called": ["k-ao1-l-d", "vbn vbd vb"], //-vb
"come": ["k-ah1-m", "vb vbd vbn vbp vbz jj"], //-vbd, -vbz
"committed": ["k-ah m-ih1 t-ah-d", "vbn jj vb vbd"], //-vb
"computerized": ["k-ah-m p-y-uw1 t-er ay-z-d", "jj nn vb vbn"], //-vb, -nn
"concerned": ["k-ah-n s-er1-n-d", "vbn jj vb vbd"], //-vb
"discriminated": ["d-ih s-k-r-ih1 m-ah n-ey t-ah-d", "vbd jj nn vb"], //-vb, -nn
"ended": ["eh1-n d-ah-d", "vbd jj vb vbn"], //-vb
"enter": ["eh1-n t-er", "vb vbn vbp"], //-vbn
"expected": ["ih-k s-p-eh1-k t-ah-d", "vbn vbd jj vb"], //-vb
"finished": ["f-ih1 n-ih-sh-t", "vbd jj vb vbn"], //-vb
"gained": ["g-ey1-n-d", "vbd vbn vb"], //-vb
"got": ["g-aa1-t", "vbd vbn vbp vb"], //-vb, -vbn
"have": ["hh-ae1-v", "vbp jj nn vb vbn"], //-vbn
"include": ["ih-n k-l-uw1-d", "vbp vbn vb"], //-vbn
"increased": ["ih-n k-r-iy1-s-t", "vbn jj vb vbd"], //-vb
"involved": ["ih-n v-aa1-l-v-d", "vbn vbd jj vb"], //-vb
"knit": ["n-ih1-t", "vbn jj nn vb"], //+vbd
"launched": ["l-ao1-n-ch-t", "vbn vbd vb"], //-vb
"lead": ["l-eh1-d", "vb vbn vbp jj nn"], //-vbn
"led": ["l-eh1-d", "vbn vbd vb"], //-vb
"lived": ["l-ay1-v-d", "vbd vbn vb"], //-vb
"outpaced": ["aw1-t p-ey-s-t", "vbd nn vb vbn vbp"], //-vb
"oversaw": ["ow1 v-er s-ao", "vbd vb"], //-vb
"oversold": ["ow1 v-er s-ow1-l-d", "vbn jj vb"], //-vb
"own": ["ow1-n", "jj vbn vbp vb"], //-vbn
"paled": ["p-ey1-l-d", "vbd vb vbn"], //-vb
"pay": ["p-ey1", "vb vbd vbp nn"], //-vbd
"plan": ["p-l-ae1-n", "nn vb vbn vbp"], //-vbn
"post": ["p-ow1-s-t", "nn in jj vb vbd vbp"], //-vbd
"prepaid": ["p-r-iy p-ey1-d", "jj vbn vb"], //-vb
"pressured": ["p-r-eh1 sh-er-d", "vbn jj nn vb vbd"], //-vb
"proliferated": ["p-r-ah l-ih1 f-er ey t-ih-d", "vbn vb vbd"], //-vb
"remade": ["r-iy m-ey1-d", "vbn nn vb"], //-vb, +vbd
"rent": ["r-eh1-n-t", "nn vb vbn vbp"], //-vbn
"reopened": ["r-iy ow1 p-ah-n-d", "vbd vbn vb"], //-vb
"reported": ["r-iy p-ao1-r t-ah-d", "vbd jj vb vbn vbp"], //-vb
"repurchase": ["r-iy p-er1 ch-ah-s", "nn vbd vbn jj vb"], //-vbd, -vbn
"resold": ["r-iy s-ow1-l-d", "vbn vbd vbp vb"], //-vb
"roast": ["r-ow1-s-t", "nn vb vbn"], //-vbn
"settled": ["s-eh1 t-ah-l-d", "vbd vbn jj vb"], //-vb
"spit": ["s-p-ih1-t", "vb nn vbd"], //+vbn
"started": ["s-t-aa1-r t-ah-d", "vbd jj vbn vb"], //-vb
"sublet": ["s-ah1 b-l-eh-t", "vb vbn"], //+vbd
"trouble": ["t-r-ah1 b-ah-l", "nn vbd vbp jj vb"], //-vbd
"wed": ["w-eh1-d", "vbn vb"], //+vbd
"were": ["w-er", "vbd vb"], //-vb
"weren't": ["w-er-ah-n-t", "vbd vb"], //-vb
"wet": ["w-eh1-t", "jj nn vbd vb vbp"], //+vbn

I suggest that the first step is to remove the 'vb' tags in words that are not in base form, which should fix the problem in this ticket. Then we can consider removing those verbs with only vb* tag and no other tags, as suggested in https://github.com/dhowe/RiTaV1/issues/357

For step 1, below are the corresponding tests to be added, taking 'concern' ('concerned') as an example:

//hasWord
expect(RiTa.hasWord("concerned")).to.be.true;
expect(RiTa.hasWord("concerneds")).to.be.false;
expect(RiTa.hasWord("concerneded")).to.be.false;

//pos
eql(RiTa.pos("concerned"), ["vbd"]);
eql(RiTa.pos("concerned", { simple: 1 }), ["v"]);

//search
expect(RiTa.search({ pos: "vb",limit: -1 }).includes("concerned")).to.be.false;
expect(RiTa.search({ pos: "vbn",limit: -1 }).includes("concerned")).to.be.true;
expect(RiTa.search('concern', { pos: "vbd", limit: -1 })).eql([ 'concerned']);
expect(RiTa.search('concern', { pos: "vbn", limit: -1 })).eql([ 'concerned']);

//conjugate
let opt = {
        number: RiTa.SINGULAR,
        person: RiTa.FIRST,
        tense: RiTa.PAST
};
expect(RiTa.conjugate("concern", opt)).eq("concerned");

//unconjugate
expect(RiTa.conjugator.unconjugate("concerned")).eq("concern");

//allTags
expect(RiTa.tagger.allTags("concerned")).eql(['vbd','jj','vbn']);

//tag
eq(RiTa.tagger.tag(["I", "am", "concerned", "about","this", "."], { inline: true }), "I/prp am/vbp concerned/jj about/in this/dt .");

//soundsLike
expect(RiTa.soundsLike("concern", { pos: 'vb' }).includes("concerned")).to.be.false;

//spellsLike
expect(RiTa.spellsLike("concern", { pos: 'vb' }).includes("concerned")).to.be.false;

please let me know if any part of the list/tests has problems.

dhowe commented 2 years ago

This looks really good -- I think the ultimate goal is to only have 'vb' for each of the regular verbs (plus all needed forms for irregular verbs) and compute all the other forms when needed... But this is a great first step -- do you want to do a PR in ritajs to start?

KarlieZhao commented 2 years ago

yes, I'll make the tests past and create a PR

dhowe commented 2 years ago

great -- also needs to handle:

RiTa.analyze('concerned')
RiTa.analyze('concerns')
dhowe commented 2 years ago

@KarlieZhao status ?

KarlieZhao commented 2 years ago

@KarlieZhao status ?

the issue in this ticket should've been fixed, however, I think we can go ahead and try to remove the words with only vb* tags in the lexicon...

dhowe commented 2 years ago

good - this will take some thought, so first come up with a plan... then we can discuss