Closed MarcRoigVilamala closed 1 year ago
Hi, many thanks for flagging this bug. For now, you can edit a copy of the lexicon, regenerate the pymusas model and run pymusas locally. But in parallel, we're running some format checks and will update this one and any others we spot as soon as possible.
Thanks again for noting this issue. @dml2611 and I have carried out extensive format checking for the English lexicons, and we've updated them in the Multilingual-USAS repo (https://github.com/UCREL/Multilingual-USAS/pull/21) and released them in the pymusas-models repo as version 0.3.3 (https://github.com/UCREL/pymusas-models/releases). So please use the updated how-to (https://ucrel.github.io/pymusas/usage/how_to/tag_text) for the new versions.
I am currently trying to assign USAS tags to an English text containing the words "in the". However, these words are currently being tagged as
['altogether', 'B5-']
. This seems to be a bug, as "altogether" is not a USAS tag. Similarly, the tag "B5", relating to "Clothes and personal belongings" does not seem to have any relation with the words "in the" on their own.This should be reproducible with the following code:
Which generates the following output for me:
I suspect this may come from the following line in the MWE file, where "altogether" is separated by a tabulation instead of a space: https://github.com/UCREL/Multilingual-USAS/blob/554dc7745f1561206287ead9ade06cd10ff0de30/English/mwe-en.tsv?plain=1#LL11661C15-L11661C15
Presumably, the line is meant to refer to the phrase "in the altogether", which would make sense with the "B5-" tag.
Is there any way I can avoid this happening?