dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
859 stars 269 forks source link

Very long property names in infobox-properties dataset #597

Open LorenzBuehmann opened 5 years ago

LorenzBuehmann commented 5 years ago

Hi, not sure if intended, but looks like some properties in the infobox-properties dataset are quite long. And with long I mean very long ...

Dataset: http://dbpedia-generic.tib.eu/release/generic/infobox-properties/2019.10.01/infobox-properties_lang=en.ttl.bz2

bzcat infobox-properties_lang=en.ttl.bz2 | awk -F " " '{ print $2 }' | sort -u | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- > preds_sorted.txt

the longest properties shown with tail preds_sorted.txt are

<http://dbpedia.org/property/fernandoCoronilAndIStudiedInTheSameElementarySchoolInCaracas,Venezuela.ThisWas%22colegioAmérica%22InTheSectionSanBernardinoInCaracas.ThisSchoolDoesn'tExistAnymoreSinceSeveralDecades>
<http://dbpedia.org/property/theTrueSelfIsItselfJustThatPureConsciousness,WithoutWhichNothingCanBeKnownInAnyWay.(...)AndThatSameTrueSelf,PureConsciousness,IsNotDifferentFromTheUltimateWorldPrinciple,Brahman&nbsp;(...)Brahman(%3Cnowiki%3E_>
<http://dbpedia.org/property/%22Cis2%5Ctimes2/3%7BB8(AGis)%7DCis(E)Cis4%5Ctimes2/3%7BC8(DA)%7D%5Ctimes2/3%7BB(Cis%3FGis)%7D%5Ctimes2/3%7BA%5Cdim(Bis%5C!Dis%3F%7D%5Ctimes2/3%7BEisFisA)%7DGis4(Fis)%7D%3C/score%3E;excerpt11(violin)%3CscoreVorbis>
<http://dbpedia.org/property/''borderBreak''*FiscalYearEnded31March2010¥3.3&nbsp;billion*FiscalYearEnded31March2011¥2.5&nbsp;billion*FiscalYearEnded31March2012¥2.3&nbsp;billion*1stQuarterEnded30June2012¥0.5&nbsp;billion*CurrencyConversion**¥3.3Billion>
<http://dbpedia.org/property/''worldClubChampionFootballIntercontinentalClubs''*FiscalYearEnded31March2010¥4.2Billion*FiscalYearEnded31March2011¥3.8&nbsp;billion*FiscalYearEnded31March2012¥3.6&nbsp;billion*1stQuarterEnded30June2012¥0.5&nbsp;billion*CurrencyConversion**¥4.2Billion>
<http://dbpedia.org/property/vagueUseOfTermsLeadsToMistakes.TheTypeSite,Tul.Gh.,ShowsContinuityBetweenItsOwnLateNeolithicAndEarlyChalcolithicPhases.ThisDoesNotMeanThatThePhase/culture%22ghassulian%22,NamedAfterTheSite,IsIdenticalWithTheEntiretyOfTheLevantineChalcolithic,SoItsDatesShouldBeBasedOnAllGhassulianSites,NotJustT.Gh.ResultAStartingDateOf%22mid5m%22_>
<http://dbpedia.org/property/gomez&Silk%22thisSamadhiIsAtTheSameTimeTheCognitiveExperienceOfEmptiness,TheAttainmentOfTheAttributesOfBuddhahood,AndThePerformanceOfAVarietyOfPracticesOrDailyActivitiesOfABodhisattva—includingServiceAndAdorationAtTheFeetOfAllBuddhas.TheWordSamadhiIsAlsoUsedToMeanTheSūtraItself.Consequently,WeCanSpeakOfAnEquation,Sūtra%3Cnowiki%3E_>
<http://dbpedia.org/property/*''sukherAsukh''(2008)*''samudrajol''(2009)*''karoKonoNeetiNai''(2009)*''premomoyMriyoman''(2010)*''maanabJamin''(2010)*''achenaManush''(2010)*''sabujNakshotro''(2010)*''rumali''(2011)*''rongBerong''(2011)*''noProblem''(2011)*''dulchhePendulum''(2011)*''aamarBariTomarBar''(2011)*''ekPoloke''(2012),Ridom*''swapnoguloIchchemoto''(2012)*''phul+Pori%3Cnowiki%3E_>
<http://dbpedia.org/property/sparham%22tsongkhapaDoesNotAcceptSvātantra(“autonomous”)Reasoning(theFourthPoint).HeAssertsThatItIsEnough,WhenProvingThatAnyGivenSubjectIsEmptyOfIntrinsicExistence,ToLeadTheInterlocutor,ThroughReasoning,ToTheUnwelcomeConsequences(prasaṅga)InTheirOwnUntenablePosition;ItIsNotNecessaryToDemonstrateTheThesisBasedOnReasoningThatPresupposesAnySortOfIntrinsic(%3Cnowiki%3E_>
<http://dbpedia.org/property/nevertheless,AccordingToBasuEtAl.(2016),TheAaaWereEarlySettlersInIndia,RelatedToTheAsi%22theAbsenceOfSignificantResemblanceWithAnyOfTheNeighboringPopulationsIsIndicativeOfTheAsiAndTheAaaBeingEarlySettlersInIndia,PossiblyArrivingOnThe“southernExit”WaveOutOfAfrica.DifferentiationBetweenTheAsiAndTheAaaPossiblyTookPlaceAfterTheirArrivalInIndia(admixtureAnalysisWithK%3Cnowiki%3E_>

I also tried with the latest (cleaned?) dataset available from the DBpedia account: https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2019.08.30 Result is the same.

So, is this intended?

kurzum commented 5 years ago

@LorenzBuehmann not sure about this bug. In general, it is too much work to fix everything in generic. That is why there are mappings. Not sure if we would prioritise fixing this. Reason: In principle there can be any anomalies that produce some junk.

Does this have a big prevalence?

LorenzBuehmann commented 5 years ago

Not really important for me, just came across this when I did vertical partitioning of the triples by predicate in Apache Parquet format on HDFS file system, which has a default file name length of 255. I was just surprised by the error because I never expected such a long property URI.

So you can mark it as "minor bug" or even "won't fix". But at least you could track those things, not sure what others might do with property URIs in general.