dalab / deep-ed

Source code for the EMNLP'17 paper "Deep Joint Entity Disambiguation with Local Neural Attention", https://arxiv.org/abs/1704.04920
Apache License 2.0
223 stars 50 forks source link

Dictionary for candidate selection #8

Closed zisding closed 6 years ago

zisding commented 6 years ago

Hi,

In the paper, it is mentioned that a dictionary built from a large Web corpus (crosswikis) is used. Actually, it (crosswikis) provides 8 dictionaries, could you please tell me which one is used and if some pre-processing operations have been applied to the original dictionary?

I noticed that the original dictionary.bz2 is 2.7G, which is much larger than the dictionary (crosswikis_p_e_m.txt: 789M) extracted from basic_data.zip.

Thank you.

octavian-ganea commented 6 years ago

I created the crosswikis_p_e_m.txt from the original Crosswikis very long time ago and unfortunately I do not have the code for it. But afaik I only removed mentions that contain the subtring "wikipedia" and converted the remaining dictionary in the format of crosswikis_p_e_m.txt .

zisding commented 6 years ago

Thank you.