Incorrect tagging of words "in the" in English

MarcRoigVilamala commented 1 year ago

I am currently trying to assign USAS tags to an English text containing the words "in the". However, these words are currently being tagged as ['altogether', 'B5-']. This seems to be a bug, as "altogether" is not a USAS tag. Similarly, the tag "B5", relating to "Clothes and personal belongings" does not seem to have any relation with the words "in the" on their own.

This should be reproducible with the following code:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])
# Load the English PyMUSAS rule based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')
# Adds the English PyMUSAS rule based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)

text = "I am sitting in the room"

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Which generates the following output for me:

Text    Lemma   POS USAS Tags
I   I   PRON    ['Z8mf']
am  be  AUX ['A3+', 'Z5']
sitting sit VERB    ['M8', 'C1', 'P1', 'G1.1', 'G2.1', 'M6', 'A9+']
in  in  ADP ['altogether', 'B5-']
the the DET ['altogether', 'B5-']
room    room    NOUN    ['H2', 'N3.6']

I suspect this may come from the following line in the MWE file, where "altogether" is separated by a tabulation instead of a space: https://github.com/UCREL/Multilingual-USAS/blob/554dc7745f1561206287ead9ade06cd10ff0de30/English/mwe-en.tsv?plain=1#LL11661C15-L11661C15

Presumably, the line is meant to refer to the phrase "in the altogether", which would make sense with the "B5-" tag.

Is there any way I can avoid this happening?

perayson commented 1 year ago

Hi, many thanks for flagging this bug. For now, you can edit a copy of the lexicon, regenerate the pymusas model and run pymusas locally. But in parallel, we're running some format checks and will update this one and any others we spot as soon as possible.

perayson commented 1 year ago

Thanks again for noting this issue. @dml2611 and I have carried out extensive format checking for the English lexicons, and we've updated them in the Multilingual-USAS repo (https://github.com/UCREL/Multilingual-USAS/pull/21) and released them in the pymusas-models repo as version 0.3.3 (https://github.com/UCREL/pymusas-models/releases). So please use the updated how-to (https://ucrel.github.io/pymusas/usage/how_to/tag_text) for the new versions.

UCREL / pymusas

Incorrect tagging of words "in the" in English #40