dalab / deep-ed

Source code for the EMNLP'17 paper "Deep Joint Entity Disambiguation with Local Neural Attention", https://arxiv.org/abs/1704.04920
Apache License 2.0
223 stars 50 forks source link

basic data #5

Closed zhaoluffy closed 6 years ago

zhaoluffy commented 7 years ago

Hi, These source files of 'wiki_redirects.txt', 'wiki_name_id_map.txt' and 'wiki_disambiguation_pages.txt' are generated or downloaded?

octavian-ganea commented 7 years ago

They were generated from the Wikipedia dump, but the code for this was not included here.

zhaoluffy commented 6 years ago

Hi, Thanks for your reply, Could you share the idea of generating the file 'wiki_disambiguation_pages.txt'

octavian-ganea commented 6 years ago

Honestly, this was generated a long time ago by one student and I don't have the code. But, if I recall well, this file contains all list pages, all pages with a (disambiguation) in the title, all pages in the Category:Disambiguation_pages category and all pages that start with "X may refer to".

zhaoluffy commented 6 years ago

I See, Thank you very much

mickvanhulst commented 5 years ago

Dear author,

I am sorry for re-opening this issue, but it seemed unproductive to open a new issue with the same type of questions.

I am currently processing a new Wikipedia dump, meaning that I have to obtain the files mentioned above. To achieve this, I am using your textWithAnchorsFromAllWikipedia2014Feb to generate the remainder of the files. Once these files are equal to the ones you have provided, then I will use my own dump which, I read, can be obtained by using WikiExtractor. I am, however, unsure about the following and hope you give me some insight into this:

  1. "X may refer to" does not capture all disambiguation pages as in some cases there are lines such as: 'X primarily refers to', see [0]. Wouldn't it be better to alter the WikiExtractor package to also write the disambiguation pages to a text file as this is using regular expressions to parse the XML file?
  2. I assumed the wiki_id_name_map.txt file was generated by simply looking at the lines with '<doc id' in them and then assigning the id to a particular name. This, however, does not seem to be the case as I am finding more ids (59.692). This is the result of altering the WikiExtractor package to store name/id combinations.
  3. I am currently filtering redirects by altering the WikiExtractor package slightly. It uses regular expressions to find redirects, which I am re-using to store the redirects in a separate file.

[0] https://en.wikipedia.org/wiki/Alien