dbpedia / GSoC

Google Summer of Code organization
37 stars 27 forks source link

Extracting Table of Contents (TOCs) for Articles #3

Closed mgns closed 5 years ago

mgns commented 6 years ago

Description

Each Wikipedia article is structured by headings and subheadings. These structures indicate the relevance of certain aspects for the described entity. Extracting such data can help in categorizing the entities and facts about the entity. E.g. cities usually have paragraphs on History, Geography and Demographics, while soccer clubs have paragraphs on Honours, Players and Stadiums. Obviously, there are pitfalls: E.g. these paragraphs are not uniformly captioned, thus an alignment (ideally to DBpedia resources) between variations would be helpful. The newly created dataset should follow Linked Data principles, e.g. a sufficiently expressive vocabulary should be used to describe TOCs (ideally as resources), the order of TOC entries, etc. Optionally, it would be interesting to apply the dataset for a meaningful application, e.g. generating missing types.

Goals

Extract TOCs from article pages and produce an RDF dataset describing the article TOCs in a comprehensive way.

Impact

A new dataset which can be used in various ways. Insights in aspects of DBpedia entities.

Warm up tasks

pratyusha972 commented 6 years ago

@mgns , I am interested in working on this project, can you please guide me on how to start working on the same.

mgns commented 6 years ago

I added a warmup task to this idea. There are mainly two approaches to go for solving this task:

  1. write a new extractor that extracts the TOCs from wikitext
  2. write a script which processes the latest NIF dataset

As the first one is the more straightforward solution, you should familiarize with the extraction framework.

When writing your proposal, you will have to describe your suggested solution for the problem.

icemc commented 6 years ago

Hello @mgns I'm also interested in working on this. When I am done with the warm up task, how should I let you know about my progress?

mgns commented 6 years ago

Simply summarize your findings in a Google Doc and share it with me.

khikmatullaev commented 6 years ago

@mgns would you like to give me your gmail? I want to share the result of the warm up task. By the way, I did not find how I can add myself to the slack chat of DBpedia? Would you like to give me instruction?

hrishikeshh commented 6 years ago

Hi @khikmatullaev , For joining slack forum of DBPedia, go to this link. Enter your e-mail ID and verify.

mgns commented 6 years ago

Just share it to: knuth@informatik.uni-leipzig.de

Thanks!

Am 14.03.2018 um 15:22 schrieb Akmal Khikmatullaev notifications@github.com:

@mgns would you like to give me your gmail? I want to share the result of the warm up task. By the way, I did not find how I can add myself to the slack chat of DBpedia? Would you like to give me instruction?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mgns commented 6 years ago

Just invite me: knuth@informatik.uni-leipzig.de

Thanks!

Am 14.03.2018 um 15:22 schrieb Akmal Khikmatullaev notifications@github.com:

@mgns would you like to give me your gmail? I want to share the result of the warm up task. By the way, I did not find how I can add myself to the slack chat of DBpedia? Would you like to give me instruction?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

shubhamvb commented 6 years ago

Hi @mgns , I read the project description and I am interested in working on it. I just wanted to clear a few queries I had. I am not sure I understand when you say the articles are not uniformly captioned and hence an alignment is necessary. What does alignment refer to in this case? Can you please elaborate a bit.

Thanks!

mgns commented 6 years ago

The project first should extract a TOC for each article in Wikipedia. The TOC should contain all headings and subheadings of the article with the respective label and order. In order to make these TOC entries better comparable it would be nice, they were mapped to some common vocabulary. Take for example the entry "Life" in the article on Vincent van Gogh and "Biography" in the article on Aline Charigot. Both entries denote similar or equal concepts. This would be cool to be captured in the dataset. E.g. one could map both to a DBpedia resource such as http://dbpedia.org/resource/Biography. Sometimes a specific resource is used as heading, e.g. "Munich International Airport" in the article on Munich. Such a mapping might be partial.