dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
849 stars 270 forks source link

en.dbpedia.org instead of dbpedia.org ? #445

Open VladimirAlexiev opened 8 years ago

VladimirAlexiev commented 8 years ago

http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples&extractors=custom makes triples with en.dbpedia.org (which does not resolve) instead of dbpedia.org, eg:

http://en.dbpedia.org/resource/Great_Britain_men's_national_basketball_team (as subject) and http://en.dbpedia.org/resource/British_Basketball (as object).

So at least the extraction sampler is broken in this regard. But I suspect that production data is also broken, because http://dbpedia.org/resource/Great_Britain_men%27s_national_basketball_team returns nothing. (Yes, there is a page https://en.wikipedia.org/wiki/Great_Britain_men%27s_national_basketball_team, and it existed for a few years)

VladimirAlexiev commented 8 years ago

The same holds of raw props: the above includes http://en.dbpedia.org/property/ instead of http://dbpedia.org/property/

jimkont commented 8 years ago

actually, dbpedia.org is the exception to all rules since I18n was actively enabled :) the same way we have fr.dbpedia.org from fr.wikipedia.org we should also have en.dbpedia.org but it was too late to change that and many applications would break if we did.

So the whole framework uses this lang convention but for en we have a special rule at the end of the extraction pipeline that replaces en.dbpedia.org to dbpedia.org

It was not easy to put this processing in all extraction ouputs so the extraction sampler is like this for the last few years.

We can either close this or leave it open in case it is picked up as a gsoc warm up tasks

VladimirAlexiev commented 8 years ago

Please keep it at least until it's explained why http://dbpedia.org/page/Great_Britain_men's_national_basketball_team is missing, yet it's returned by this query: select * {?country a dbo:Country}

jimkont commented 8 years ago

this is a different issue. @pkleef is this related to the new 2015-10 version? I see the data are not yet deployed in dbpedia.org but maybe the code from the adjusted vad did

jimkont commented 8 years ago

@VladimirAlexiev took a closer look and the dbo:Country triple comes from ST-Types provided by @HeikoPaulheim and is duplicate of #241 and #414

regarding the display of http://dbpedia.org/page/Great_Britain_men's_national_basketball_team, if you do a DESCRIBE it works fine