facebookresearch / KILT

Library for Knowledge Intensive Language Tasks
MIT License
893 stars 90 forks source link

Anchor texts in knowledge source. #62

Open dalek-who opened 1 year ago

dalek-who commented 1 year ago

I extract anchors fields from kilt_knowledgesource.json and get 140,965,933 anchor texts, however, only 6,498,301 of them have corresponding Wikipedia id, which is even less than examples in blink-train-kilt.jsonl (9M lines). Is there a full version of anchor texts and corresponding Wikipedia id?

Also, what's the meaning of href field in anchors?

happen2me commented 1 year ago

Hello, I also noticed the same problem and opened a thread here. Did you find a way around?

I haven't made out what the exact meaning of href yet. But when I paste the href span to https://en.wikipedia.org/wiki/, each item can be properly redirected to a Wikipedia page.

dalek-who commented 1 year ago

@happen2me sorry, I still have no solution.

happen2me commented 1 year ago

I worked out a walk around: to generate Wikidata ID from href, then convert Wikidata ID to Wikipedia ID.

  1. Install my modified version of Wikimapper: pip install git+https://github.com/happen2me/wikimapper.git
  2. Process href
    def process_href(href):
        href = unquote(href)
        href = href.replace(" ", "_")
        # remove subsection
        subsection_idx = href.find("#")
        if subsection_idx != -1:
            href = href[:subsection_idx]
        return href
  3. Map to Wikidata ID
    from wikimapper import WikiMapper
    mapper = WikiMapper("path/to/index_enwiki-latest.db")
    wikidata_id = mapper.title_to_id(process_href(href), uncased=True)`
  4. Convert to Wikipedia ID or anything you wish. For example, title texts can be retrieved with mapper.id_to_titles(wikidata_id)

The result is accurate enough. I did a test on one sample, 316 out of 319 hrefs are successfully converted.