dbpedia / mappings-tracker

This project is used for tracking mapping issues in mappings.dbpedia.org
9 stars 6 forks source link

xxxDate vs xxxYear properties #56

Open Nono314 opened 9 years ago

Nono314 commented 9 years ago

Most of the xsd:date properties in the ontology have a twin property of type gYear (but a few do not). Many mappings, arguably good ones, thus map every date property in a template to the 2 variants in the ontology. But there are also a lot of mappings that don't follow this pattern and therefore lose data during extraction.

This complex and seemingly strange pattern almost certainly was crafted as a workaround for the current limitations in the DateTimeParser. This one indeed uses the target property type to determine what to look for in the input template property. If the former is an xsd:date it will look for a full date, and optionally allow a missing day, extracting a gMonthYear value, but will simply ignore a year-only date. That's why a second property was needed to record every bit of information.

But this is making both the ontology and mappings a bit obscure, and also making life harder for anyone trying to query DBpedia. So it would be nice to be able to make all that simpler.

However there may be a reason why this has lasted for so long. One I can think of is the old InfoboxExtractor which is throwing parsers blindly at a property value and keeps the first thing that gets extracted. So if the DateTimeParser were to extract year only, it would also catch many integer values. But this could probably be handled by setting a strict mode on the parser...

So what are the other possible reasons? Anyone knowing about that? @jcsahnwaldt or @jimkont ?

VladimirAlexiev commented 9 years ago

These are all good questions. But I want to point out that whatever the logic of the parser, there's no need to involve two props. Rather than two maps to xxxDate and xxxYear, you can have the same two maps targeting just one field xxx.

My logic is that a field name should reflect some meaning, NOT the precision of the data. It's the job of the datatype (xsd:date vs xsd:gYear) to reflect the precision. Having two fields just complicates queries and IMHO does not serve a useful purpose.

Nono314 commented 9 years ago

@VladimirAlexiev As said on the other thread I agree with you.

But you can't just ignore what the extraction framework actually does. What I'm advocating is that instead of twisting the ontology to fit with the extraction capabilities, we should rather improve the framework to enable the ontology we want.

With the current code, you can't just do what you say. You wouldn't have two maps targeting one field, it would just be twice the same map... And a year value would never be extracted into an xsd:date property.

Actually there's a simpler solution than what I proposed above. We could easily change SimplePropertyMapping so that it would setup the DateTimeParser not just based on the property type, but also using the mapping's unit parameter as it does with the UnitValueParser. That would allow for your solution with two mappings against one single property.

I really think we should also get rid of the double mappings by fixing the parser itself, but at least we would have an easy solution that would benefit to everyone querying DBpedia.

VladimirAlexiev commented 9 years ago

I was assuming that unit can be used with dates :-) Guess never wrote a date mapping myself, always just copied someone else's.

If the date parser can be fixed to "extract full date if possible, otherwise month-year if possible, otherwise year", that's clearly the best solution. Because with wikipedia you can hardly ever be sure what precision a field will have. Even if all occurrences today are years, tomorrow a new occurrence can be a full date.

jimkont commented 9 years ago

An easy way to fix this is to create something a FirstSuccessParser(parsers : List[Parser]) class that tries all parsers in order and breaks at the first value it finds. In this case we could pass a DateTimeParser with xsd:date and another DateTimeParser with xsd:gYear. If the first fails to find a date the second one will be used. Comments?

jimkont commented 9 years ago

Actually, the accurate class signature should be FirstSuccessParser(parsers : List[Parser]) extends DataParser @Nono314 interested for a PR? Otherwise I can take a shot on this

Nono314 commented 9 years ago

I have three pending PRs and a major refactoring of UnitValueParser ongoing that I'd like to complete for this release. I think it's enough for now... and it's your idea, so just proceed :)

I would probably have done it right in the parser itself, but having a generic fallback mechanism sounds interesting. I'm pretty sure there would be other use cases.

jimkont commented 9 years ago

Any example article that I can test?

Nono314 commented 9 years ago

Maybe Charles-Maurice de Talleyrand-Périgord?

It has a wide range of PersonFunctions with starts and ends covering xsd:date, gMonthYear, gDayMonth and gYear. It also uses DateTimeParser both directly (two properties) and through DateIntervalParser (both ends in the same property).

The related mappings are two infoboxes (Politicien and Prélat catholique) and an additional template (Succession/Ligne, a recent redirect).