mgns commented 6 years ago

Description

DBpedia has diverse mapping communities and, sometimes amphisemy between languages or, lack of coordination, may lead to the use of wrong properties for a DBpedia mapping between different languages. A classic example of the elevation of a mountain that is described and mapped as dbo:elevation across most languages except Spanish where they were using dbo:height because both concepts have the same term in their language. This project is meant to identify and, ideally, correct all such mappings from DBpedia. The working approach is defined in “Predicting Incorrect Mappings: A Data-Driven Approach Applied to DBpedia”.

Goals

Apply and improve ideas of the paper to the actual DBpedia mappings. The paper provides a working proof of concept but needs to run in the open and adjustments will be probably needed. Goal is to apply the techniques in as many language pairs as possible and identify all misaligned mappings. A great add-on would be a simple interface where the mapping community will be given the identified wrong mappings and they would vote if the mapping is indeed incorrect and, if so, suggest the proper mapping.

Impact

Very big impact in DBpedia data quality.

Warm up tasks

Read the latest main DBpedia paper to get to know how the framework and the mappings work
Write a few actual infobox mappings in your language
Read this paper
Re-run the experiments described in the paper. Here you may find the training set: https://www.openml.org/s/53
Experiment with a few other algorithms
Mentors

Mariano Rico, Nandana Mihindukulasooriya, Dimitris Kontokostas

Keywords

Machine learning, schema alignment

nandana commented 6 years ago

@mgns the link to the paper seems to be broken. Can you please you this one? Thanks! https://svn.aksw.org/papers/2018/SAC_DBpedia_mappings_alignment/public.pdf

marija-stanojevic commented 6 years ago

Hello,

I read your paper and I am interested in working on this task for GSoC 2018. Did you somehow account for class imbalance in your algorithm? I think that great results you got are consequence of high class imbalance and in cases where labeled data set is more balanced, you have much worse results (93% vs 67%). Is that possible explanation?

I'd like to do this project by adding some more features (for example in case of code and postalCode, value structure/size in two different languages may be a good feature) and also trying some other algorithms (that would use network to improve results). If you didn't account for class imbalance, I'd also account for it or create more balanced labeled data set. Is there any chance to get more labeled data of "negative" kind to balance classes? Can we find expert people who would validate results of prediction? This can be helpful and less manually expensive way to make larger labeled data set here.

I imagine this as research summer project which results would be new working algorithm with better results (on balanced data set) and possibly a paper. Is this what you had in mind for this project?

nandana commented 6 years ago

Thanks a lot @mstanojevic118 for the interest in this project!

The class imbalance and the limited number of annotations were definitely two weak points in the study. Though we could have rectified the class imbalance using an oversampling technique such as SMOTE, we believe it would be much better if we can get more annotations for wrong mappings. Your hypothesis that the high results are due to the class imbalance is something that would worth validating.

We did the annotation with the help of the DBpedia community and we still have a lot of data that is not being annotated. We will be happy to make them publically available and we can make a plan to get them annotated during the early part of the project. You can also get them annotated with the help of your colleagues if they are familiar with the language and have some SemWeb background. But as you said that is one area we would like to improve this study.

Further, it is also nice to add new features that would help to distinguish the wrong mappings. We believe this could help to improve the results.

In addition to what you proposed, it would be also nice to investigate a bit on how this work can be integrated into the DBpedia Framework so it will be actually used to improve the DBpedia mappings and data.

I believe it will good if you start a shared Google Doc with your ideas for the improvements and your plan so we will be able to give you some continuous feedback during the proposal preparation phase.

@MarianoRico @jimkont please add if I missed something.

vfrico commented 6 years ago

I find this proposal really interesting and challenging. As @mstanojevic118 mentions, the class imbalance could have introduced a bias into the classification. During the GSoC would be a good work to generate more labeled examples.

I would like to play a bit with knowledge graphs embeddings on different DBpedia's and find if it is possible to identify "relevant" relations.

I'm working on my proposal and then I will share with any of the mentors.

nandana commented 6 years ago

This is just a friendly reminder that it's only 6 days to go!

Please feel free to share your proposals with us before the official submission so that we can give you some feedback on how to improve them.

marija-stanojevic commented 6 years ago

Hi, I am sorry for not sending you proposal yet. It was hectic 10 days at my University. I can share my draft later today. Is it fine if I do that through the "share your draft" part on GSoC application?

nandana commented 6 years ago

@mstanojevic118 @vfrico sure, please use the "share your draft" functionality of the app!

MarianoRico commented 6 years ago

Dear @mstanojevic118 and @vfrico, thanks for your comments. Besides the theoretical enhancements (imbalance or/and new features), I also consider as very relevant to focus on how can we move this handcrafted approach to something more automated. I imagine a web tool aimed at assisting users on annotating efficiently.

marija-stanojevic commented 6 years ago

Hello @nandana @MarianoRico, I shared with you yesterday (today early morning in European time) my proposal through GSoC dashboard. I have some questions there, so if you can check my proposal and add some comments, it would be very helpful. Thank you.

nandana commented 6 years ago

Thanks for sharing the draft with us! I went through the current version today and left you some comments!

vfrico commented 6 years ago

I've also shared a draft on the GSoC page. I hope that @nandana @MarianoRico @jimkont can take a look and receive feedback!. Thanks

nandana commented 6 years ago

Thank you @vfrico! We will go through the proposal and provide some feedback today.

dbpedia / GSoC

Automatic schema alignment between DBpedia mappings in different languages #15

Description

Goals

Impact

Warm up tasks

Mentors

Keywords