Trouble tagging English text

FahdCodes commented 2 years ago

I'm facing trouble tagging English text. I'm using spaCy's 'en_core_web_sm' dataset for the pipeline. Apparently, the English dataset does not have 'token.pos', something that the 'usas_tagger' requires. The documentation says that the tagger should work even without 'token.pos', however when I go ahead and feed the english text to the tagger, it simply tags 'Z99' to all the words.

Would really appreciate any valuable inputs.

Below is the full code. Thanks!

!pip install pymusas
!python -m spacy download en_core_web_sm

import spacy
from pymusas.spacy_api.taggers import rule_based
from pymusas.pos_mapper import UPOS_TO_USAS_CORE

nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])

usas_tagger = nlp.add_pipe('usas_tagger')

_ = nlp.analyze_pipes(pretty=True)

usas_tagger.pos_mapper = UPOS_TO_USAS_CORE

input_text = "It was raining in London and the cat was missing"

tagged_text = nlp(input_text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in tagged_text:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.usas_tags}')

perayson commented 2 years ago

Hi, thanks for raising this as an issue. At the moment, pymusas is not supporting tagging English text. We have not yet released the English lexicons that would be required as the knowledge source, see https://github.com/UCREL/Multilingual-USAS but English is part of our planned roadmap, see here: https://github.com/UCREL/pymusas/blob/main/ROADMAP.md

The current languages supported are described here, each with example code: https://ucrel.github.io/pymusas/usage/how_to/tag_text

For now, I will leave this issue open since we are planning to release an English version later as described in the roadmap.

perayson commented 2 years ago

I am closing this now that we have released the English lexicons (https://github.com/UCREL/Multilingual-USAS/tree/master/English) and provided example code pipeline and documentation for English (https://ucrel.github.io/pymusas/usage/how_to/tag_text#english)

UCREL / pymusas

Trouble tagging English text #31