dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
855 stars 269 forks source link

remove Hidden_categories #389

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

Filter out maintenance (hidden) categories and don't emit them in the dataset. These categories are useful only to Wikipedia maintainers and are not useful for content consumers.

Unfortunately DBpedia does not extract classification coming from templates (transclusion), see #378. Most hidden cats are marked in that way, so:

I think extracting from templates will be very hard to implement. Other possible sources:

SQL

All classifications are available with SQL, eg from http://quarry.wmflabs.org:

select page_title, page_id
from page, categorylinks 
where cl_to='Hidden_categories' and cl_from=page_id

Quarry has a timeout of 10 minutes, so isn't appropriate for large-scale querying. If you select SQL, you'd probably have to make a local copy of the DB

Wikipedia API

https://www.mediawiki.org/wiki/API:Categorymembers https://www.mediawiki.org/wiki/Special:ApiSandbox http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&format=json&cmtitle=Category%3AHidden%20categories&cmprop=title%7Ctype&cmtype=page%7Csubcat&cmlimit=100

You could try different return formats:

Notes

VladimirAlexiev commented 9 years ago

There's a small difference between admin cats and hidden cats: https://en.wikipedia.org/wiki/Wikipedia:PROJCATS: "administration category..on article pages ..should be made a hidden category"

Nevertheless, I think Hidden_categories is the most precise way we got to find out which are admin cats.