erre-quadro / spikex

SpikeX - SpaCy Pipes for Knowledge Extraction
Apache License 2.0
398 stars 28 forks source link

Umlauts #12

Open Fetzii opened 2 years ago

Fetzii commented 2 years ago

Description

Getting categories for a page with umlauts from my dewikicore (Cem Özdemir: https://de.wikipedia.org/wiki/Cem%C3%96zdemir) It crashes, what shouldn't happen. There is also an english wiki page for him (https://en.wikipedia.org/wiki/Cem_%C3%96zdemir)

What I Did

from spikex.wikigraph import load as wg_load
wg = wg_load("dewiki_core")
page = "Cem_Özdemir"
categories = wg.get_categories(page, distance=1)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
paoloq commented 2 years ago

Thank you, @Fetzii! I'll investigate this issue, but it could be related to some bad handled encoding. I keep you posted on what I'll find.

Fetzii commented 2 years ago

It seems to me, that I have managed to fix the problem locally by changing line 234 in dumptools.py from: line = line.decode("latin1") to: line = line.decode(encoding="utf-8", errors="backslashreplace")