chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

some wikipedia categories aren't matched by regexp #222

Closed ckot closed 5 years ago

ckot commented 5 years ago

This is neither a bug in textacy's wikipedia dump parsing code nor with the upstream mwparser fromhell, it's more about a some simple regexp changes to be a bit defensive regarding bad author markup.

Expected Behavior

Ideally, you'd expect all categories to be matched. It isn't a perfect world, however, and author markup errors do exist.

Current Behavior

I've discovered that although only a small percentage, several thousand categories within a wikidump are marked up in lower-case

Num occurrences (although grep can't tell me the distribution of these via namespace)

# in enwikinews:  2580
# enwiki: 14562

Possible Solution

when checking for category links, simply search for both links which start with the expected string MAPPING_CAT[self.lang] and the lower case version of that string.

Steps to Reproduce (for bugs)

Context

I'm extracting namespace 14 (Categories) where the page's title represents a category itself, and the page's categories represent that category's parent categories. As I iterate over the records, I simply add these edges to a 'networkx.DiGraph', and end up with the category hierarchy.

I noticed that some category nodes are isolates (nodes not connected to any other nodes), which for other than internal wikimedia tracking categories (which aren't supposed to be linked to by anything other than by pages with issues) doesn't really make sense. For example, while browsing through the list of isolates (enwikinews), I found that Category:Richard Stallman wasn't linked to anything, but the page for that category https://en.wikinews.org/wiki/Category:Richard_Stallman shows that it should be linked to Category:News_articles_by_person, Category:FLOSS, Category:Programmers, and Category:Activists. It turns out that the page author used 'category' rather than 'Category' for all of these, so they are missed.

Your Environment

ckot commented 5 years ago

I've added PR #223, which addresses this

bdewilde commented 5 years ago

Closing this issue, now that the PR has been merged.