UCREL / pymusas

Python Multilingual Ucrel Semantic Analysis System
https://ucrel.github.io/pymusas/
Apache License 2.0
30 stars 13 forks source link

Auxiliary verb rule for single word semantic lexicon lookup #27

Open apmoore1 opened 2 years ago

apmoore1 commented 2 years ago

To incorporate auxiliary verb rules into the USAS Rule Based Tagger.

Definition of auxiliary verb rules

All POS tags used here are from the CLAWS C7 tagset.

In English (at least in the C version of the semantic tagger) we use auxiliary verb rules for POS tags VB* (be), VD* (do), VH* (have), to determine the main and auxiliary verbs and therefore alter the semantic tag.

An auxiliary verb would normally be given the USAS semantic tag Z5 grammatical bin, whereas the main verb would be given a non Z5 tag. For example in the sentence (format is token_USAS semantic tag) below the auxiliary verb is have and the main verb is finished:

I_Z8mf have_Z5 finished_T2- my_Z8 lunch_F1 ._PUNC 

We have approximately 35 rules in place for amending the semantic tags on be, do, and have after the initial set of potential semantic tags are applied. An example rule for have is as follows:

VH*[Z5] (RR*n) (RT*n) (XX) (RR*n) (RT*n) V*N

If the sequence of POS tags matches a given context, VH* (POS tag for have) followed by V*N (POS tag for the word finished) with optional intervening adverbs (R* POS tags) or negation (XX POS tag), then the rule instructs the tagger to change the semantic tag on the auxiliary verb have to be Z5.

For semantic taggers in other languages (the Java versions), we do not have auxiliary/main verb rules in place.

How this rule maps to spaCy pipeline through UPOS tagset

In the UPOS tagset and therefore spaCy POS models we can use the AUX POS tag from the UPOS tagset, instead of VB* (be), VD* (do), VH* (have). Below is the code and output of running the small English spaCy model on the sentence I have finished my lunch.:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I have finished my lunch.')
print('Token\tPOS')
for token in doc:
    print(f'{token.text}\t{token.pos_}')

Output:

Token   POS
I   PRON
have    AUX
finished    VERB
my  PRON
lunch   NOUN
.   PUNCT
perayson commented 2 years ago

I've updated the comment to explain things further. It'd be good to find some evaluation of how accurate the auxiliary verb detection is in spaCy. We described our original approach in this UCREL technical paper: https://ucrel.lancs.ac.uk/papers/techpaper/vol3.pdf