dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

Some dbo:abstract / rdfs:comment have coordinates in front #392

Closed VladimirAlexiev closed 8 years ago

VladimirAlexiev commented 9 years ago

The dbo:abstract / rdfs:comment of some cities, eg http://bg.dbpedia.org/resource/bgdbr/Одеса (Odessa) or http://bg.dbpedia.org/resource/bgdbr/Лондон (London) start with the coordinates repeated twice, eg: 46.466667, 30.73333346.466667, 30.733333Одеса (на украински Одеса, на руски Одесса, на румънски Odesa) е черноморски град в Украйна...

As you can see, the source of that page starts with an infobox that has these coordinates (but broken down into deg,min,sec): https://bg.wikipedia.org/w/index.php?title=Одеса&action=edit&oldid=6328404.

But how can the coordinates end up into the abstract???

VladimirAlexiev commented 9 years ago

София has these coordinates and the abstract doesn't have this bug:

| сев-ширина        = 42.697556
| изт-дължина       = 23.323638 

Варшава has these coordinates and the abstract has this bug:

| гео-ширина       = 52.217
| гео-дължина      = 21.033

Same for London:

| гео-ширина               = 51.507
| гео-дължина              = -0.128

and has this bug.

Weirdness

jimkont commented 9 years ago

Can you check if you include this in the MW LocalSettings.php

$wgExtractsRemoveClasses = array_merge($wgExtractsRemoveClasses, array( '.metadata', 'span.coordinates', 'span.geo-multi-punct', 'span.geo-nondefault', '#coordinates', '.reflist', '.citation', '#toc', '.tocnumber', '.references', '.reference', '.noprint'));
VladimirAlexiev commented 9 years ago

@boyan-simeonov Please check

jimkont commented 9 years ago

@VladimirAlexiev @boyan-simeonov did this solve your issue? if so feel free to close:)

VladimirAlexiev commented 8 years ago

Checked the two cities mentioned above and they don' have the bug: