dbpedia / mappings-tracker

This project is used for tracking mapping issues in mappings.dbpedia.org
9 stars 6 forks source link

Clean-up mappings to redirected templates #59

Open jimkont opened 9 years ago

jimkont commented 9 years ago

As Wikipedia merges various templates we should clean them up as well. Examples are the http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_Governor which in Wikipedia redirects to Infobox_officeholder

What we do in the framework before we execute the mappings is to resolve all redirects so, whenever we see Infobox_Governor we execute the mappings for Infobox_officeholder, The same happens for the statistics, we merge all redirected templates to statistics for the target only, in this case Infobox_officeholder.

We have two case here: 1) We have mappings for both (old & new) so we should delete the old one which is useless. 2) we don't have any mappings for the new redirected template so we should rename the mapping to reflect the correct template. For this case the framework does this automatically but it's good to be up to date with Wikipedia.

Again, if we generate a list we could do that with a bot

Nono314 commented 9 years ago

I don't think it's as simple as that...

The framework actually only resolves redirects for templates that do not have a mapping on their own (see here ). So if there are mappings for both the source and target of a redirect, the one that currently applies is still the old one (not useless at all!), and just removing it will change the extraction result. This is especially critical in term of mapped class: it is common that several specialized templates are all redirected to a broader one. In that case the redirected templates are probably also mapped to specialised subclasses, while the target template may be mapped to a high level superclass. So removing the older mapping would just result in a huge loss of accuracy.

The problem is, there is no easy way to detect this in the mapping site, except by spotting strange behaviours. For example when looking at the mapping for Infobox Subdivision administrative and testing it you can easily see that all results belong to class Department and do not fit at all with the mapping they're supposed to test... The actual mapping is here (strangely it also appears in the statistics).

Another glitch I have seen is this old mapping that I have moved here since the underlying template [has been redirected](http://fr.wikipedia.org/wiki/Modèle:Insérer dynastie). However it still appears unmapped in the statistics. Do these get updated after a move or only when saving a mapping?

VladimirAlexiev commented 9 years ago

Related issues:

@Nono314, thanks for the clarifications! Now I understand better what caused my initial confusions with all these issues.

Maybe you're right that the "source template name" carries extra info. This matches my observation that https://en.wikipedia.org/w/index.php?title=Template:Infobox_country "has two Syntaxes": "Country or territory" and "Geopolitical organization"

But I still think we should strive NOT to use redirected maps. Else the extractor behavior becomes too complex and confusing. My confusion and your observations of glitches bear this out. In addition to periodic cleanups, https://github.com/dbpedia/mappings-tracker/issues/3 asks for some warning when a redirected map is visited/listed. Hopefully shared code can be used for both tasks.

@jimkont: I don't think we should junk a source template if there already is a redirected template. I think we should merge the source and redirected templates manually, using eg comparison tables like at http://mappings.dbpedia.org/index.php/Mapping_en_talk:Infobox_country. We can emit different classes based on a discriminator field (eg for Infobox_country it could be the field "membership").

jimkont commented 9 years ago

@Nono314 you are right and yes you need to make an edit to trigger the statistics update

@VladimirAlexiev I agree the more mappings we have the more complex it will be to manage them in the future. And yes, merge makes more sense indeed.

Nono314 commented 9 years ago

@VladimirAlexiev I agree, we should avoid mapping redirected templates. I was just pointing out that bluntly trashing the redirected mappings would lead to a loss of accuracy. Restoring it by finding a discriminator is perfectly fine.

Detection can be easily automated, but I don't think fixing should be also automated. It should remain a manual merge as you say.

Actually I have found out that there already was code in the statistics module to detect those redirected templates. It was just not working because of a bug here (the intersection was always empty since redirects hold templates with namespace while mappings has them without).

I fixed this in https://github.com/dbpedia/extraction-framework/pull/367 which also adds a new column to the templates statistics page to display for each one the number of mapped properties that are not found. This should make it quite simple to spot and hopefully fix these issues.

jimkont commented 9 years ago

I merged & deployed the PR, can anyone give some feedback to @Nono314 's new statistics page?

VladimirAlexiev commented 9 years ago

@Nono314, that is super! Not only the last column, but notes like "Музикален изпълнител: NOTE: the mapping for Музикална група is redundant!" that point out the redirected templates.

jimkont commented 9 years ago

I made a separate page for the redirects on every language http://mappings.dbpedia.org/server/mappings/en/redirects/

VladimirAlexiev commented 9 years ago

Currently EN has 50 redirected templates; DE 2, FR 1, BG 0. Need to gradually go through the list and merge the redirected template into the target template.