Open dalek-who opened 1 year ago
Hello, I also noticed the same problem and opened a thread here. Did you find a way around?
I haven't made out what the exact meaning of href
yet. But when I paste the href span to https://en.wikipedia.org/wiki/
, each item can be properly redirected to a Wikipedia page.
@happen2me sorry, I still have no solution.
I worked out a walk around: to generate Wikidata ID from href
, then convert Wikidata ID to Wikipedia ID.
pip install git+https://github.com/happen2me/wikimapper.git
href
def process_href(href):
href = unquote(href)
href = href.replace(" ", "_")
# remove subsection
subsection_idx = href.find("#")
if subsection_idx != -1:
href = href[:subsection_idx]
return href
from wikimapper import WikiMapper
mapper = WikiMapper("path/to/index_enwiki-latest.db")
wikidata_id = mapper.title_to_id(process_href(href), uncased=True)`
mapper.id_to_titles(wikidata_id)
The result is accurate enough. I did a test on one sample, 316 out of 319 hrefs are successfully converted.
I extract
anchors
fields from kilt_knowledgesource.json and get 140,965,933 anchor texts, however, only 6,498,301 of them have corresponding Wikipedia id, which is even less than examples in blink-train-kilt.jsonl (9M lines). Is there a full version of anchor texts and corresponding Wikipedia id?Also, what's the meaning of
href
field inanchors
?