dhowe / RiTaV1

RiTa: the generative language toolkit
http://rednoise.org/rita
GNU General Public License v3.0
354 stars 78 forks source link

Question about pos tag 'vbp'-non-3rd person singular present #365

Closed cqx931 closed 8 years ago

cqx931 commented 8 years ago

For the pos tag 'vbp', in most cases(more than 1300 words) it comes together with the tag 'vb'. And then here are 200 of them with only 'vbp' but no 'vb'. Questions: 1."we","you" is in tagged as 'vbp' in dict... can we delete the 'vbp' tag from these two? 2.In my opinion the 'vbp' tag is only meaningful for the cases "am", "are","is"...In other cases it is just 'vb'. Is there other cases that it is useful?

So I think there could be 3 choices to deal with "vbp" 1.add "vb" tag for these 200 words except useful cases 2.get rid of "vbp" tag for most of the cases and just keep the few useful cases, convert "vbp" tag of these words in the list to "vb"

The word list: flit vbp posture nn vbp embody vbp starts vbz nns vbp bale nn vbp anguish nn vbp flirt vbp hamstrung jj vbn vbp originated vbd jj vbn vbp surf nn vbp ah uh vbp am vbp rb is vbz rb nns vbp chomp nn vbp we prp vbp squelch vbp purport vbp gleam nn vbp commute vbp nn been vbn vbp masquerade nn vbp worth jj in nn rb vbn vbp helps vbz vbp nns fatigue nn vbp understate vbp pervade vbp spout nn vbp spook vbp troop nn vbp doan vbp sport nn jj vbp cluster nn vbp dost vbp smolder vbp outnumber vbp pertained vbp marked vbn jj vbd vbp wrack nn vbp persecute vbp jockey nn vbp spurt nn vbp scoff vbp nn bowl nn vbp litter nn vbp parade nn vbp flunk vbp authenticate vbp enliven vbp intermingle vbp pantomime nn vbp liken vbp darken vbp option nn vbp phantom jj vbp nn that in dt nn rb rp uh wp vbp wdt brand nn vbp jj rb slight jj vbp nn interject vbp taint nn vbp splinter nn vbp jj trundle nn vbp tint vbp nn reek vbp nn captain nn vbp tunnel nn vbp predominate vbp coddle vbp pale jj vbp nn retrench vbp grouse vbp nn surfeit nn vbp treasure nn vbp trespass nn vbp damped vbn vbd vbp preoccupy vbp age nn vbp and cc vbp jj rb nnp are vbp nnp art nn vbp anger nn vbp bug nn vbp bum nn vbp jj romp nn vbp pine nn vbp rose vbd vbp jj nn pity nn vbp dog nn vbp streak nn vbp separated vbn jj vbd vbp evidence nn vbp nest nn vbp trumpet nn vbp grip nn vbp titter nn vbp obtained vbn vbd vbp limits nns vbp vbz croak nn vbp exude vbp lumber nn vbp overeat vbp chug vbp shroud vbp flurry nn vbp abstract jj nn vbp espouse vbp jet nn vbp totaled vbd vbn vbp lap nn vbp clam nn vbp recoil nn vbp underwrote vbd nn vbp decreases nns vbp vbz nap nn vbp cable nn vbp jj droop vbp nn cruise nn vbp tipple vbp augur vbp idolize vbp proposed vbn vbd vbp jj your prp$ prp vbp wing nn vbp absolve vbp prize nn jj vbp whoosh vbp nn freak nn vbp major jj nn vbp rehash nn vbp handcuff vbp agitate vbp outspend vbp pride nn vbp rap nn vbp helped vbd vbn vbp rim nn vbp vacation nn vbp recollect vbp paraphrase nn vbp demur vbp empower vbp squint vbp proliferate vbp crisscross vbp wheel nn vbp tee nn vbp hark vbp hast vbp clamor vbp nn stalk vbp long jj vbp rb vex vbp vow nn vbp wag nn vbp prowl nn vbp jj flower nn vbp arch nn vbp proscribe vbp crowds nns vbp vbz refresh vbp chain nn vbp lurch nn vbp you prp vbp rp crackle nn vbp bespeak vbp transcend vbp fake jj nn vbp blurt nn vbp clock nn vbp solace nn vbp frown vbp canoe nn vbp plans nns vbp vbz lump nn vbp sigh nn vbp plays vbz nns vbp squeegee vbp cloak nn vbp whittle vbp ripen vbp stink nn vbp segment nn vbp germinate vbp garage nn vbp atrophy nn vbp dart nn vbp hoot nn vbp honk vbp vomit vbp inhabit vbp spade nn jj vbp rumble nn vbp spew vbp scamper vbp overlay nn vbp assault nn vbp deride vbp beautify vbp slaughter nn vbp recheck vbp throng nn vbp

dhowe commented 8 years ago

Questions: 1."we","you" is in tagged as 'vbp' in dict... can we delete the 'vbp' tag from these two?

Yes, those appear to be mistakes.

2.In my opinion the 'vbp' tag is only meaningful for the cases "am", "are","is"...In other cases it is just 'vb'. So can we actually change all these 'vbp' tags other than the 3 meanningful cases to 'vb'? What's your opinion about this?

Good question. So you are suggesting that we keep vbp (Verb, non-3rd person singular present) only for irregular verbs? I think this is a good idea, however it may break some of our tests. For example, what is the correct POS-tags for the 2nd word in 'You flit from place to place'? I think it should be 'vbp'...

We will have to check what we get with the changes you suggest. Perhaps we can add a rule to the PosTagger not to use 'vb' unless it is a single word.

dhowe commented 8 years ago

Also, remove completely:

ah uh vbp
doan vbp
dost vbp
pertained vbp
liken vbp
retrench vbp
bespeak vbp

Remove 'vbp' from list for:

starts vbz nns vbp
anguish nn vbp
is vbz rb nns vbp
been vbn vbp
worth jj in nn rb vbn vbp
helps vbz vbp nns
sport nn jj vbp
cluster nn vbp
marked vbn jj vbd vbp
option nn vbp
phantom jj vbp nn
that in dt nn rb rp uh wp vbp wdt
grouse vbp nn
surfeit nn vbp
damped vbn vbd vbp
and cc vbp jj rb nnp
art nn vbp
rose vbd vbp jj nn
dog nn vbp
separated vbn jj vbd vbp
evidence nn vbp
trumpet nn vbp
obtained vbn vbd vbp
limits nns vbp vbz
shroud vbp
flurry nn vbp
totaled vbd vbn vbp
lap nn vbp
clam nn vbp
underwrote vbd nn vbp
decreases nns vbp vbz    (also remove nns)
proposed vbn vbd vbp jj
your prp$ prp vbp
wing nn vbp
helped vbd vbn vbp
rim nn vbp
vacation nn vbp
tee nn vbp
hark vbp
hast vbp
crowds nns vbp vbz
you prp vbp rp
solace nn vbp
canoe nn vbp
plans nns vbp vbz
plays vbz nns vbp
garage nn vbp
spade nn jj vbp (also remove jj)
cqx931 commented 8 years ago

I see... that is where vbp is for. But it also won't be helpful for the POS-tagger, as it comes with two options anyway.(vb, vbp), right? Is POS-tagger checking the dict as well? I thought the dict files is just for RiLexicon... In general I think it doesn't help a lot to keep 'vbp' in the dictionary if it just repeats itself after vb.

dhowe commented 8 years ago

The main reason for having POS in the dictionary is for the POS-tagger.

But if you want to give a try, you can remove those regular vbp, and we can run the tests and see...

Good checking in any case

cqx931 commented 8 years ago

hark vbp hast vbp shroud vbp

These three words in your list only has 'vbp' as postags. Remove these three words completely?

dhowe commented 8 years ago

shroud should be nn, the other two can be removed