dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
853 stars 269 forks source link

removing parasitic prefix/suffix from raw props #314

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

Numbered raw props are collapsed to one prop:

http://fr.wikipedia.org/w/index.php?title=Antioche&action=edit:

 | division                 = [[Région méditerranéenne]]
 | nom de division          = [[Régions de Turquie|Région]]
 | division2                = [[Hatay]]
 | nom de division2         = [[Provinces de Turquie|Province]]
 | division3                = [[Région méditerranéenne]]
 | nom de division3         = [[Districts de Turquie|District]]

Results in this on fr.dbpedia.org:

http://fr.dbpedia.org/property/nomDeDivision    http://fr.dbpedia.org/resource/Provinces_de_Turquie
http://fr.dbpedia.org/property/nomDeDivision    http://fr.dbpedia.org/resource/Régions_de_Turquie
http://fr.dbpedia.org/property/nomDeDivision    http://fr.dbpedia.org/resource/Districts_de_Turquie

All three "nom de divisionN" are mapped to the same nomDeDivision.

jimkont commented 9 years ago

This is the default behavior to avoid multiple property definitions. Here is the code that cleans the URIs: https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala#L278-290

VladimirAlexiev commented 9 years ago

But they have different meaning. Similarly-numbered props go in groups, so they should go into different IntermediateNodeMappings. If you colapse the props by wiping the numbers, wont' such distinctions be lost?

Politician templates are the worst, eg see http://mappings.dbpedia.org/index.php?title=Mapping_bg:Държавник_инфо&action=edit.

You have a different prop group for every nested position>mandate (eg 3*5=15), and the grouping is both by prop prefix and suffix, and it's not consistent.

Eg there are 10 props "предшестван", all mapped to "predecessor" but in different groups:

 предшестван от
 предшестван от2
 предшестван от3
 втори_мандат_предшестван от
 втори_мандат_предшестван от2
 втори_мандат_предшестван от3
 трети_мандат_предшестван от
  ...

Which of them collapse, which of them you don't, why you do, and why you don't?

VladimirAlexiev commented 9 years ago

@jimkont Please reopen for further investigation

jimkont commented 9 years ago

This does not affect the mappings extractor & intermediate nodes, only the raw infobox extractor data. In some cases it might make sense to keep the numbers but there are also many where it does not. I don't have an example handy but last time I checked (2-3 years ago) there were quite a few.

So I am reopening and suggest we do this like the mappings case and examine the diffs from both options

VladimirAlexiev commented 9 years ago

So when you write templateProperty=x, that may be not really be dbprop:x but a modification thereof? I agree with these modifications, since the semantics of raw prop "parasitic" prefixes & suffixes is not transmitted clearly, it's better to tranmit them all as the "root" of the prop name

Oh this needs to be documented in a whole chapter... my head hurts.

Two questions about the modification:

jcsahnwaldt commented 9 years ago

There's a misunderstanding...

This issue only affects properties in the http://dbpedia.org/property/ namespace, produced by InfoboxExtractor.

The mappings produce properties in the http://dbpedia.org/ontology/ namespace. See MappingExtractor. Completely different code.

When you write templateProperty=x, you get exactly dbo:x. No modification. You never get dbprop:x - that's a completely different namespace.

Hope that clears things up. :-)

jcsahnwaldt commented 9 years ago

http://wiki.dbpedia.org/Downloads2014#mapping-based-properties

http://wiki.dbpedia.org/Downloads2014#raw-infobox-properties

http://wiki.dbpedia.org/Datasets#h434-10

jcsahnwaldt commented 9 years ago

P.S.: "When you write templateProperty=x, you get exactly dbo:x. No modification." - I think that's correct, but I'm not 100% sure, e.g. upper/lower case. Would have to check the code.

jcsahnwaldt commented 9 years ago

More precisely: You won't get dbo:x. When you write templateProperty=x in a mapping, it matches exactly x in the Wikitext source, not x2 or anything else (check the code of SimplePropertyMapping.scala for details). Of course, you also specify ontologyProperty=y in the mapping, so you will get http://dbpedia.org/ontology/y, which is sometimes abbreviated as dbo:y.

jcsahnwaldt commented 9 years ago

I think we can close this issue. It only affects the raw Infobox properties, which is basically a legacy dataset. In the last few years, we haven't put much work into it, and for good reason - the DBpedia wiki strongly recommends using the mapping based properties. They are much better.

VladimirAlexiev commented 9 years ago

I think you misunderstood most of what I wrote above (not your brilliant self today ;-)). I grok that when I write in an IntermediateNodeMapping:

{{ PropertyMapping | templateProperty = втори_мандат_предшестван от3 | ontologyProperty = predecessor }}

It generates

bgdbr:Тодор_Живков bgdbp:вториМандатПредшестванОт <pred>. # raw: bgdbp: and dropped suffix
bgdbr:Тодор_Живков__1 dbo:predecessor <pred>. # mapped: dbo: and in IntermediateNode

BUT I'm pleading it should generate

bgdbr:Тодор_Живков bgdbp:предшестванОт <pred>.
bgdbr:Тодор_Живков__1 dbo:predecessor <pred>.

because the prefix втори_мандат_ is just as parasitic as the numeric suffix.

And also: the parasitic numeric-alphabetic suffix of предшестван от3a should also be dropped.

Documented at http://mappings.dbpedia.org/index.php/Rewriting_templateProperty. @jcsahnwaldt could you please take a look and see if it's accurate?

Don't throw away the raw props! They're there even if there are no mappings, or the mappings are wrong (alas, they are often wrong, even in big dbpedias like fr). So there are many real-world queries that mix raw and mapped props.

jimkont commented 9 years ago

Vladimir, please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor) use the default option that extracts only labels and mappings (let me know if this is not the case)

jcsahnwaldt commented 9 years ago

@VladimirAlexiev I checked out http://mappings.dbpedia.org/index.php/Rewriting_templateProperty . Nice page! But it's largely... how do I say it nicely... well, it's just wrong. Now I know where this misunderstanding is coming from, and why you dared question my authority. ;-) I added these lines to the page:

Here's what actually happens:

Here's what the InfoboxExtractor does:

Here's what the MappingExtractor does:

jcsahnwaldt commented 9 years ago

In other words:

The InfoboxExtractor doesn't care about the mappings at all, processes all named properties and generates

bgdbr:Тодор_Живков bgdbp:вториМандатПредшестванОт <pred>. # raw: bgdbp: and dropped suffix

The MappingExtractor doesn't care at all about the property names produced by InfoboxExtractor, only extracts properties for which a mapping exists and generates

bgdbr:Тодор_Живков__1 dbo:predecessor <pred>. # mapped: dbo: and in IntermediateNode

The two are completely independent. (Well, they both process the AST produced by the Wikitext parser, but that's it.)

jcsahnwaldt commented 9 years ago

I hope that clears up things. This issue has nothing to do with mappings.

But you raised a few good questions about the InfoboxExtractor:

That's correct. It also does a few more things. See InfoboxExtractor.getPropertyUri for details.

Probably not. I think there are some properties that contain a digit somewhere in the middle of their name. Something more specific would be better.

Sounds good! Some config values in the class InfoboxExtractorConfig are already language specific. Might be relatively easy to add a few more such configuration values and use them in InfoboxExtractor.getPropertyUri.

VladimirAlexiev commented 9 years ago

@jcsahnwaldt Thanks for the edits! I'll fix up that page. @jimkont "please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor)": is there any harm in that?

jimkont commented 9 years ago

@jimkont https://github.com/jimkont "please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor)": is there any harm in that?

Yes, there is a limit in the extraction samples and in big articles might not cannot get all the expected triples just like the Elvis Prisley link you posted on the mailing lists. so if you want to test the mappings use only the default extractor, if you want to see what else DBpedia would produce use the custom but beware it might not be complete

VladimirAlexiev commented 9 years ago

Added your warning to http://mappings.dbpedia.org/index.php/Main_Page#Custom_or_Default_Extractor.

Now back to the topic: based on https://github.com/dbpedia/mappings-tracker/issues/51, I suggest to also remove suffixes are 1-2 digits followed by a single letter, i.e. match this: [^0-9][0-9]{1,2}[:alpha:]$

Nono314 commented 9 years ago

I don't think that's what you want, it would also remove the character before the digits... looking for negative look-behind instead? i.e (?<![0-9])[0-9]{1,2}[a-z]?$

Anyway, I have always thought that the issue with "raw" props is more on the value side, since they're actually anything but raw. This just leads to newcomers seeing bugs everywhere just as in https://github.com/dbpedia/extraction-framework/issues/317. The problem is, throwing parsers blindly (as opposed to selecting the parser based on the ontology property type) at a prop does not always yield something meaningful.

That's why, when preparing for mapping, they may only give a shallow hint at the real template property content. And when trying to fix parser bugs, they are often of very little help...

VladimirAlexiev commented 9 years ago

Right about the regex. Did you write that guide? I think it's excellent; I added a bit and linked it to other pages I wrote. We need more best practices on specific topics (eg mapping Place Relations or Dimensions)