dbpedia / mappings-tracker

This project is used for tracking mapping issues in mappings.dbpedia.org
9 stars 6 forks source link

review 4 en mappings that are on ignorelist #40

Closed VladimirAlexiev closed 9 years ago

VladimirAlexiev commented 9 years ago

I looked at https://github.com/dbpedia/extraction-framework/blob/master/server/src/main/statistics/ignorelist_en.txt. It has:

I did "curl -I" for each of the templates. There are 4 exceptions:

Each needs to be reviewed, and potentially deleted

Nono314 commented 9 years ago

As I understand it, ignore list is mainly used for statistical purposes but doesn't prevent extraction.

I would better like to have a comment stating why a given entry has been placed on the ignore list so that it can be reviewed and taken out of the list if needed.

Entries like this one seem perfectly OK for me.

jcsahnwaldt commented 9 years ago

All templates should either be mapped or ignored. That's an _exclusive or_. Even better: get rid of ignore lists. See #39 .

If a template is on the ignore list, it basically means it doesn't make sense to add a mapping for this template (because it's not an infobox). We exclude it from the statistics because it would give the impression that the mapping coverage is worse than it really is.

On the other hand, if there is a useful mapping for a template, it should not be on the ignore list.

My opinion:

Mapping_en:Infobox_venue and Mapping_en:Infobox_dam should not be on the ignore list. Having a mapping for them is fine.

Mapping_en:Persondata should probably be deleted, and it should be on the ignore list. Persondata is handled by PersondataExtractor.scala. Or maybe it's OK to duplicate the job of PersondataExtractor as a mapping...

Not sure about Authority_control - it's not an infobox, but it seems to provide useful data.

I think the main problem with templates like Authority_control was that they are not specific enough to extract a type for the resource. Previously, this could lead to problems because we would create a new sub-resource URI if we found mulitple templates defining a type for the resource. But I think we now have a more sophisticated rule: if the new type is compatible with the type previously found (i.e. it is a sub or super type), don't create a new resource URI.

VladimirAlexiev commented 9 years ago

we would create a new sub-resource URI if we found mulitple templates

Sounds crazy. What's the link to that sub-resource?

Authority_control is very important, Wikidata is making great inroads in that area

I vote to delete 1,2,4 from ignorelist ASAP. As for Persondata: https://github.com/dbpedia/extraction-framework/issues/344

jimkont commented 9 years ago

Maybe we can change it to a conditional mapping depending on the existing properties and leave owl:Thing only for GND.

good idea :) we could also add the formater URLs using prefix/suffix e.g. http://mappings.dbpedia.org/index.php/Mapping_commons:Chemical_structure_verified

Nono314 commented 9 years ago

My point was that the ignore list can be challenged, not just the mappings. People would write a mapping to extract useful data from a template that is supposedly to be ignored and not even know how to edit the ignore list.

I guess, many of the templates on the ignore list were put on it due to existing limitations in the framework at that time such as the one mentioned by @jcsahnwaldt. The problem is there's no track of that, so they can't be removed from the list once the issue is solved. In that respect, having ignore templates and discussions as proposed in #39 would be very useful.

If I have a look at the most used ignored templates I can see:

So maybe a review of the ignore list could be performed?

VladimirAlexiev commented 9 years ago

@Nono314: I agree with your sentiment that some templates exiled to the ignore list can be usefully mapped, eg I did Listen->soundRecording. Please make a separate task to review the ignore list; and maybe another for the 2 templates that you mention.

We need writing on mapping best practices! Succession Box and other Politician templates are some of the most complicated. I gave an example here: http://mappings.dbpedia.org/index.php/Rewriting_templateProperty#Wikipedia_Prop_Structures but we need a separate page

jimkont commented 9 years ago

Just make sure these templates don't result in many unnecessary intermediate nodes