PlanTL-GOB-ES / lm-biomedical-clinical-es

Official source for Spanish pretrained biomedical and clinical language models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
Apache License 2.0
26 stars 2 forks source link

NER Example #2

Open giovaninb opened 2 years ago

giovaninb commented 2 years ago

Why there is a tag Ġ in the output?

Expected output with the predicted entities:

[ {'word': 'Ġcalcio', 'score': 0.9963880181312561, 'entity': 'B-NORMALIZABLES', 'index': 24, 'start': 137, 'end': 143}, {'word': 'Ġcalcio', 'score': 0.9965023398399353, 'entity': 'B-NORMALIZABLES', 'index': 29, 'start': 163, 'end': 169}, {'word': 'Ġmagnesio', 'score': 0.996299147605896, 'entity': 'B-NORMALIZABLES', 'index': 32, 'start': 178, 'end': 186}, {'word': 'ĠPTH', 'score': 0.9950509667396545, 'entity': 'B-PROTEINAS', 'index': 34, 'start': 189, 'end': 192} ]

gonzalez-agirre commented 2 years ago

Hi Giovani,

In the RoBERTa and GPT-2 tokenizer, the space before a word is always part of the subword. The special token Ġ is used to mark a space. Take into account that a word may be splitted into two or more subwords, and this special token is also used to distinguish between full words and subwords. For instance, creatinina can be divided in 'Ġcreat' and 'inina' (note that 'inina' does not start with the special token.

Best, Aitor.

On Sun, Aug 7, 2022 at 8:51 PM Giovani Bettoni @.***> wrote:

Why there is a tag Ġ in the output? Expected output with the predicted entities:

[ {'word': 'Ġcalcio', 'score': 0.9963880181312561, 'entity': 'B-NORMALIZABLES', 'index': 24, 'start': 137, 'end': 143}, {'word': 'Ġcalcio', 'score': 0.9965023398399353, 'entity': 'B-NORMALIZABLES', 'index': 29, 'start': 163, 'end': 169}, {'word': 'Ġmagnesio', 'score': 0.996299147605896, 'entity': 'B-NORMALIZABLES', 'index': 32, 'start': 178, 'end': 186}, {'word': 'ĠPTH', 'score': 0.9950509667396545, 'entity': 'B-PROTEINAS', 'index': 34, 'start': 189, 'end': 192} ]

— Reply to this email directly, view it on GitHub https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC5U2DMA4CWPVR6NCDMB6GLVYAASZANCNFSM5524TODQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>