This is neither a bug in textacy's wikipedia dump parsing code nor with the upstream mwparser fromhell, it's more about a some simple regexp changes to be a bit defensive regarding bad author markup.
Expected Behavior
Ideally, you'd expect all categories to be matched. It isn't a perfect world, however, and author markup errors do exist.
Current Behavior
I've discovered that although only a small percentage, several thousand categories within a wikidump are marked up in lower-case
Num occurrences (although grep can't tell me the distribution of these via namespace)
# in enwikinews: 2580
# enwiki: 14562
Possible Solution
when checking for category links, simply search for both links which start with the expected string
MAPPING_CAT[self.lang] and the lower case version of that string.
Steps to Reproduce (for bugs)
Context
I'm extracting namespace 14 (Categories) where the page's title represents a category itself, and the page's categories represent that category's parent categories. As I iterate over the records, I simply add these edges to a 'networkx.DiGraph', and end up with the category hierarchy.
I noticed that some category nodes are isolates (nodes not connected to any other nodes), which for other than internal wikimedia tracking categories (which aren't supposed to be linked to by anything other than by pages with issues) doesn't really make sense. For example, while browsing through the list of isolates (enwikinews), I found that Category:Richard Stallman wasn't linked to anything, but the page for that category https://en.wikinews.org/wiki/Category:Richard_Stallman shows that it should be linked to Category:News_articles_by_person, Category:FLOSS, Category:Programmers, and Category:Activists. It turns out that the page author used 'category' rather than 'Category' for all of these, so they are missed.
This is neither a bug in textacy's wikipedia dump parsing code nor with the upstream
mwparser fromhell
, it's more about a some simple regexp changes to be a bit defensive regarding bad author markup.Expected Behavior
Ideally, you'd expect all categories to be matched. It isn't a perfect world, however, and author markup errors do exist.
Current Behavior
I've discovered that although only a small percentage, several thousand categories within a wikidump are marked up in lower-case
Num occurrences (although grep can't tell me the distribution of these via namespace)
Possible Solution
when checking for category links, simply search for both links which start with the expected string
MAPPING_CAT[self.lang]
and the lower case version of that string.Steps to Reproduce (for bugs)
Context
I'm extracting namespace 14 (Categories) where the page's
title
represents a category itself, and the page'scategories
represent that category's parent categories. As I iterate over the records, I simply add these edges to a 'networkx.DiGraph', and end up with the category hierarchy.I noticed that some category nodes are
isolates
(nodes not connected to any other nodes), which for other than internal wikimedia tracking categories (which aren't supposed to be linked to by anything other than by pages with issues) doesn't really make sense. For example, while browsing through the list of isolates (enwikinews), I found thatCategory:Richard Stallman
wasn't linked to anything, but the page for that category https://en.wikinews.org/wiki/Category:Richard_Stallman shows that it should be linked to Category:News_articles_by_person, Category:FLOSS, Category:Programmers, and Category:Activists. It turns out that the page author used 'category' rather than 'Category' for all of these, so they are missed.Your Environment
spacy
version:spacy
models:textacy
version: