Open albcunha opened 3 years ago
@albcunha Thank you for reporting this issue. I'll try to look into it in more detail. In the meantime, which column is the desired one of the two for Portuguese? :smiley:
Ideally, I think a general rule would be that token.idx for the take only the first character and the second token could rest of the word (the remaining characters). They could have this format:
A A 0
linguagem linguagem 2
da de 12
a a 13
paz paz 15
pode pode 19
ser ser 24
uma uma 28
cultura cultura 32
. . 39
There are many words in portuguese words this contractions happens, some are not identified by the model. But, still, It happens a lot. The change, as suggested, would solve all the words I checked, such as these:
do, dos, da, das, dum, duns, duma, umas, doutro, doutros, doutra, doutras, donde, no, nos, na, nas, num, nuns, numa, numas, noutro, noutros, pelo, pelos, pela, pelas.
That are other contractions that the model wont "catch", so I think it does not matter.
Thanks for any help!
Hello! Maybe there is something not working correctly with token.idx on portuguese.
I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").
I saw https://github.com/TakeLab/spacy-udpipe/pull/17, it seems to be the same problem, but it seem it wont work for portuguese.
This works (token.text and text slice are the same):
The The language language of of peace peace can can be be a a culture culture . .
This wont work (token.text and text slice are not the same after multiword): :
A A linguagem linguagem de da a p paz z p pode de s ser r u uma a c cultura ltura. .
Any ideas of how to circumvent this?