THU-KEG / DacKGR

Source codes and datasets for EMNLP 2020 paper "Dynamic Anticipation and Completion for Multi-Hop Reasoning over Sparse Knowledge Graph"
MIT License
45 stars 11 forks source link

data process #3

Open chrislouis0106 opened 2 years ago

chrislouis0106 commented 2 years ago

Hi, there, Would you suggest opening the code about how to get the NELL23K and WD-singer datasets? And, did you download your Wikipedia data directly from the official website “https://www.wikidata.org/wiki/Wikidata:Database_download/en”? Then, did you create the triple by integrate the entity and corresponding to the concept? If this, such a dataset is also too sloppy!

davidlvxin commented 2 years ago

This work was done early last year, and I can't find the original generation code, but I still remember the processing idea.

We were building the Wikidata dataset based on KACC-large. Specifically, we filtered out concepts with "singer" words in the labels, and then identified entities belonging to those concepts as the seed entity set. In addition, there are fewer direct concatenated edges between these entities, which do not fully reflect the knowledge related to singers, such as what is the birthplace of a singer. Thus, we randomly added some entities to the seed entities among the high frequency entities connected to these seed entities. The ratio of the number of newly added entities to the number of the original set of seed entities is about 2:5. After that, we formed the set of relations by keeping only the more high-frequency relations between entities. Finally, we used the entity and relation sets to extract the corresponding triples from the KACC-large entity triples as our dataset.

We acknowledge that constructing the domain dataset in this way is rather crude. However, the construction of the dataset is not the main contribution of our work. We aim to validate the effectiveness of our model on sparse knowledge graphs. Finally, we call for more work to construct accurate datasets with manual quality control to advance the field.

chrislouis0106 commented 2 years ago

By your clear explanation and reading the KACC paper, I have already known the dealing process of dataset. In fact, most knowledge graphs in the open domain are sparse graphs, and in particular Wikidata-based knowledge graphs are also sparse graphs. I guess you only extracted singer-related knowledge at that time just to reduce the knowledge size, if you do not add to the extracted graph, the sparsity is so high that there is no way to model reasoning process. Thank you very much.