Open giovaninb opened 2 years ago
Hi Giovani,
In the RoBERTa and GPT-2 tokenizer, the space before a word is always part of the subword. The special token Ġ is used to mark a space. Take into account that a word may be splitted into two or more subwords, and this special token is also used to distinguish between full words and subwords. For instance, creatinina can be divided in 'Ġcreat' and 'inina' (note that 'inina' does not start with the special token.
Best, Aitor.
On Sun, Aug 7, 2022 at 8:51 PM Giovani Bettoni @.***> wrote:
Why there is a tag Ġ in the output? Expected output with the predicted entities:
[ {'word': 'Ġcalcio', 'score': 0.9963880181312561, 'entity': 'B-NORMALIZABLES', 'index': 24, 'start': 137, 'end': 143}, {'word': 'Ġcalcio', 'score': 0.9965023398399353, 'entity': 'B-NORMALIZABLES', 'index': 29, 'start': 163, 'end': 169}, {'word': 'Ġmagnesio', 'score': 0.996299147605896, 'entity': 'B-NORMALIZABLES', 'index': 32, 'start': 178, 'end': 186}, {'word': 'ĠPTH', 'score': 0.9950509667396545, 'entity': 'B-PROTEINAS', 'index': 34, 'start': 189, 'end': 192} ]
— Reply to this email directly, view it on GitHub https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC5U2DMA4CWPVR6NCDMB6GLVYAASZANCNFSM5524TODQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Why there is a tag Ġ in the output?
Expected output with the predicted entities:
[ {'word': 'Ġcalcio', 'score': 0.9963880181312561, 'entity': 'B-NORMALIZABLES', 'index': 24, 'start': 137, 'end': 143}, {'word': 'Ġcalcio', 'score': 0.9965023398399353, 'entity': 'B-NORMALIZABLES', 'index': 29, 'start': 163, 'end': 169}, {'word': 'Ġmagnesio', 'score': 0.996299147605896, 'entity': 'B-NORMALIZABLES', 'index': 32, 'start': 178, 'end': 186}, {'word': 'ĠPTH', 'score': 0.9950509667396545, 'entity': 'B-PROTEINAS', 'index': 34, 'start': 189, 'end': 192} ]