TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

Text indexation on portuguese #32

Open albcunha opened 3 years ago

albcunha commented 3 years ago

Hello! Maybe there is something not working correctly with token.idx on portuguese.

I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").

I saw https://github.com/TakeLab/spacy-udpipe/pull/17, it seems to be the same problem, but it seem it wont work for portuguese.

This works (token.text and text slice are the same):

nlp = spacy_udpipe.load("en")
text = "The language of peace can be a culture."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

The The language language of of peace peace can can be be a a culture culture . .

This wont work (token.text and text slice are not the same after multiword): :

nlp = spacy_udpipe.load("pt")
text = "A linguagem da paz pode ser uma cultura."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

A A linguagem linguagem de da a p paz z p pode de s ser r u uma a c cultura ltura. .

Any ideas of how to circumvent this?

asajatovic commented 3 years ago

@albcunha Thank you for reporting this issue. I'll try to look into it in more detail. In the meantime, which column is the desired one of the two for Portuguese? :smiley:

albcunha commented 3 years ago

Ideally, I think a general rule would be that token.idx for the take only the first character and the second token could rest of the word (the remaining characters). They could have this format:

A A 0
linguagem linguagem 2
da de 12
a  a 13
paz paz 15
pode pode 19
ser ser 24
uma uma 28
cultura cultura 32
. . 39

There are many words in portuguese words this contractions happens, some are not identified by the model. But, still, It happens a lot. The change, as suggested, would solve all the words I checked, such as these: do, dos, da, das, dum, duns, duma, umas, doutro, doutros, doutra, doutras, donde, no, nos, na, nas, num, nuns, numa, numas, noutro, noutros, pelo, pelos, pela, pelas.

That are other contractions that the model wont "catch", so I think it does not matter.

Thanks for any help!