erre-quadro / spikex

SpikeX - SpaCy Pipes for Knowledge Extraction
Apache License 2.0
398 stars 28 forks source link

Incomplete list of categories #13

Open Fetzii opened 2 years ago

Fetzii commented 2 years ago

Description

I want to get all categories of a page, but most categories are missing

What I Did

from spikex.wikigraph import load as wg_load
page = "Peking_2022"
categories = wg.get_categories(page, distance=1)

What I get: ['Category:Olympische_Winterspiele_2022'] The output I expect: ['Austragung der Olympischen Winterspiele', 'Olympische Winterspiele 2022', 'Sport (Hebei)', 'Sportveranstaltung 2022', 'Sportveranstaltung in Peking', 'Wikipedia:Veraltet nach Jahr 2022', 'Zukünftige Sportveranstaltung'] Prove: https://de.wikipedia.org/wiki/Olympische_Winterspiele_2022

I created a categorylinks dictionary from the categorylinks.sql.gz, so that the keys are the page_ids and under each key is the list of categories. I used your functions to get the page_id: page_id = self.get_pageid(self.redirect(page)) and my categorylinks dictionary . With this method I get the expected output. If this behaviour is not desired, I would like to think that there is a problem with the processing of categorylinks.sql.gz on your side.

andremacola commented 2 years ago

I'm facing the same problem in with a ptwiki_core