dbpedia / mappings-tracker

This project is used for tracking mapping issues in mappings.dbpedia.org
9 stars 6 forks source link

investigate raw props with digit #51

Closed VladimirAlexiev closed 9 years ago

VladimirAlexiev commented 9 years ago

https://github.com/dbpedia/extraction-framework/issues/314 proposes to collapse raw prop name "propNblah" where N is a digit to "prop", just like it does for "propN" (i.e. to consider "Nblah" a parasitic suffix like "N" is considered). @jcsahnwaldt objects "I think there are some properties that contain a digit somewhere in the middle of their name".

So we need to investigate this. Look at http://wiki.dbpedia.org/Downloads2014 sec "Raw Infobox Property Definitions". We get eg:

cat infobox_property_definitions_en.ttl|cut -d " " -f1 |sort|uniq|perl -ne "print if m{property/\S*\d}" > infobox-withSuffix-en.txt

There certainly are some curiosities, eg

!doNotRemoveThisMessageAndAddMapUntil18September%3Cmap
http//hosted.ap.org/dynamic/stories/o/obitWeizsaecker%3Fsite
pastMembersRayGunn19811983,Recorded10OfMyOriginalsPlusOurVersionOfDancingInTheStreets,UnfortunatelyThisNeverGotPastTheRoughMixStage,AsTheSameWithMyEffortsInTheNewOrder
pop'''leek'''ionBlank1Title
pop13.

Or this, which quizzically just about makes some sense :-)

howLargeIsLarge%3FCm

On the other hand, there are some legit cases. Eg these list rider attributes for 2 classes of motorbikes:

<http://dbpedia.org/property/poleRider125Bike>
<http://dbpedia.org/property/poleRider125Country>
<http://dbpedia.org/property/poleRider125CountryFlagSuffix>
<http://dbpedia.org/property/poleRider250Bike>
<http://dbpedia.org/property/poleRider250Country>
<http://dbpedia.org/property/poleRider250CountryFlagSuffix>

So I think we should precise the "parasitic suffix" rule like this: "digits followed by a single letter".

> cat infobox_property_definitions_en.ttl|cut -d " " -f1 |sort|uniq|perl -ne "print if m{[a-z]\d\d?[a-z]>}" > infobox-withSuffix-en.txt

Looking at the result, a lot are good candidates for collapsing. But there are also imaginatively named props like this (who does that?):

successo2r
termEn2d
termStar1t

What do you think?

jimkont commented 9 years ago

This is not relevant to the mappings tracker / mappings wiki should go to the main repo

VladimirAlexiev commented 9 years ago

Sorry, I get confused what to put where, nor how to move between projects. Luckily this tracker links them well.