About how to build the wiki dataset

YeDeming commented 5 years ago

Hi Daniil,

I am a PHD student from Tsinghua University, I am very interested in your work. Recently, I have tried to link the wikipedia to wikidata for building a relation dataset with some special restriction, but I met some trouble.

I used string match to link entity in sentences, which are recognized by Stanford CoreNLP toolkit or with link annotations, to the wikidata ids. But I found it does not work very well since titles and alias in wikidata entity may be the same.

1) In your paper, "From each sentence in a complete article we extract link annotations and retrieve Wikidata entity IDs corresponding to the linked articles. There is an unambiguous one-to-one mapping between Wikidata entities and Wikipedia articles". I have trouble to find the unambiguous one-to-one mapping between Wikidata entities and Wikipedia articles.

2) For the entity recognized by Stanford CoreNLP toolkit in sentence, What step does "the using HeidelTime to extract dates" work for?

If it is convenient, could you share your code for building the wiki dataset? Or could you share your code for entity linking?

Hope to get your help! Thanks a lot in advance! Deming Ye

YeDeming commented 5 years ago

Fine, I found the the unambiguous one-to-one mapping between Wikidata entities and Wikipedia articles.

daniilsorokin commented 5 years ago

Hi!

For this dataset, we have only linked the mapped the Wikipedia link annotations to Wikidata IDs. If you have a Wikipedia link, you can use the Wikidata sitelinks (https://www.wikidata.org/wiki/Help:Sitelinks) for the one-to-one mapping between Wikipedia articles and Wikidata entities. Is that the one you have found?

Linking all entities recognised by the Stanford CoreNLP NE tool is much harder. As you correctly mention, this is a problem of entity linking. You can link some named entities to Wikidata by simple string matching, but for a general case you would need a full entity linker.

We have shared the code for our entity linker here: https://github.com/UKPLab/starsem2018-entity-linking Unfortunately, you'd need a local installation of Wikidata for it to work. There are other entity linkers, that you might want to consider, for example, DBPedia Spotlight. Anyway, as I said above, we didn't use an entity linker for the relation extraction dataset.

HeidelTime is just a tool to recognise dates in text and we used to extract date values. Dates are not entities and you don't need to link them to Wikidata, but some relations connect entities and dates (for example "founded in"). If an entity and a data appear in the same sentence, we have extracted relations for such a pair as well.

The code that we have used to put together the dataset was part of a different project and has a lot of dependencies in Java. I will try to look if we can share it here.

I hope it helps! Feel free to post follow up questions.

Best, Daniil

YeDeming commented 5 years ago

Thanks for your enthusiastic reply!

Wikipedia sitelinks will help me a lot!

I am still a little confused. In your papar, "We extract named entities and noun chunks from the input sentences with the Stanford CoreNLP toolkit to identify entities that are not covered by the Wikipedia annotations (e.g. Obama in the sentence above). We retrieve IDs for those entities by searching through entity labels in Wikidata.."
What is the searching process? Is that string matching or other?

Your entity linking tool is so fantastic, but I am worried about the time it take for wikipedia article as the neural model. I am trying some statistic method to do entity linking now.

Thank you again! Deming Ye

daniilsorokin commented 5 years ago

Hi!

Oh yes, sorry, that was a simple string matching, no fancy entity linking there. We extracted named entities and noun chunks and if it matched some entity in Wikidata (no ambiguity) than we linked it, too. Obama is actually is not a very good example, since there are several entities with the name "Obama".

Yes, entity linking can be very resource heavy.

Best, Daniil

YeDeming commented 5 years ago

Thanks a lot!

UKPLab / emnlp2017-relation-extraction

About how to build the wiki dataset #15