dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
852 stars 270 forks source link

extract categorizations hidden in templates (transclusions) #378

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

Stub categories seem to be implemented most of the time through transclusion of templates. Eg the article https://en.wikipedia.org/w/index.php?title=Şıra&action=edit has:

{{Turkey-cuisine-stub}}
{{nonalcoholic-drink-stub}}

The fact that something is a stub category is not semantic knowledge (that is Wikipedia editorial knowledge). However, its super-categories are semantic knowledge. While for the first one there's no loss since the article includes

[[Category:Turkish cuisine]]

there's no corresponding category link for the second one.

Extracting this is a tall order, since it's buried in a chain of templates: https://en.wikipedia.org/w/index.php?title=Template:Nonalcoholic-drink-stub&action=edit:

 | subject   =[[non-alcoholic beverage]]–related
 | category  =Non-alcoholic beverage stubs

https://en.wikipedia.org/w/index.php?title=Category:Non-alcoholic_beverage_stubs&action=edit:

{{Stub Category|article=[[non-alcoholic beverage]]s|newstub=nonalcoholic-drink-stub|category=non-alcoholic beverages}}
[[Category:Drink stubs| Non-alcoholic]]
[[Category:Beverages Task Force|Σ]]

https://en.wikipedia.org/w/index.php?title=Template:Stub_Category&action=edit https://en.wikipedia.org/w/index.php?title=Template:Stub_category&action=edit

Isn't there a Wikipedia API to get all categories of an article? I think it'd be better to handle categories this way...

VladimirAlexiev commented 9 years ago

Stub cats are not a good example since they are Hidden_categories thus not supposed to be useful to the user (see #389).

The example above points out that the knowledge "Şıra is a non-alcoholic drink" is lost unless we handle the "nonalcoholic-drink-stub" template which produces https://en.wikipedia.org/wiki/Category:Non-alcoholic_beverage_stubs. However, the editorial policy says an article should have both content categories, and stub marker, so hopefully this is an insignificant exteption.

Put on low priority until we find examples of useful cats coming from transclusions.

VladimirAlexiev commented 9 years ago

I'm wrong about stubs being hidden: https://en.wikipedia.org/wiki/Wikipedia:HIDDENCAT: "stub categories or "uncategorized article" categories .. are not hidden". Indeed, https://en.wikipedia.org/wiki/Category:Turkish_cuisine_stubs is not a hidden cat.

But nevertheless, content cats should not normally come from templates: https://en.wikipedia.org/wiki/Wikipedia:TCAT: "it is recommended that articles not be placed in ordinary content categories using templates"