chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

small change to catch bad wiki category markup #223

Closed ckot closed 5 years ago

ckot commented 5 years ago

Although I imagine textacy shouldn't be involved in a cat and mouse game of handling bad wiki markup, this simple change allows for categories which are marked up in lower case to be caught, which accounts for a few thousand extra category links being recognized in a media wiki dump.

Description

It simply searches, within the wiki_links extracted via mwparserfromhello, for both category links represented by (MAPPING_CAT[self.lang]) and the lower-case version of that string.

Motivation and Context

It simply catches a larger portion of category links with a minimal cost in computation.

This fixes Issue #222 which I created

How Has This Been Tested?

I don't know how to truly test this, as I suppose it's non-deterministic like natural language. I don't know how many category links are still being missed, I can only confirm that there is a net increase in the number of categories recognized compared to the number of categories recognized prior to this simple change.

My testing of this is external to this code. I use textacy's processing of namespace 0 and namespace 14 (categories) to create a networkx.DiGraph which represents the category hierarchy. Basically, the resulting graph of categories produced by this fix has far fewer nodes which are isolates.

The change is harmless to the rest of the code, however due to my inability to compare the raw content to the markup-stripped content, I didn't modify the regexp used to remove category links via the text() method, meaning this only impacts the records() method.

Screenshots (if appropriate):

Types of changes

Checklist:

bdewilde commented 5 years ago

Looks good to me! Do you think there would be any issues if I changed the regex pattern you mentioned to be case-insensitive? (i.e. add re.IGNORECASE to re_categories?

ckot commented 5 years ago

probably wouldn't be an issue. the only reason I didn't modify the regexp as well was that I noticed it wasn't used by records() and that's all I use. I don't know how changes which only effect text() can be tested - not that I provided unit test for my change, but at least extracting Category and lower-case of 'Category' from a list of already extracted wikilinks is more easy to test.

Also, you might want to consider allow an optional space between{{MAPPING_CAT[self.lang]}}: and the category itself (more bad author markup). Prior to discovering that this project supported processing media wiki dumps, I was going to submit a PR to another project, github:attardi/wikiextractor, but I think that project might be dead. Anyway, mwparserfromhell seems resiliiant enough to catch those, but if you want to catch those in text() you'll probably want to add the optional space to the regexp.

Anyway, I apologize for all the individual issues and PRs, but I think I have a few more issues which I'm not sure are bugs or features. Should I gmail you and talk offline regarding these, or should I simply post more issues? long story short, 1) the extracted categories are still present at the end of the articles text, and 2) there is one particular template {{w|link|label}} which should get converted to [[link|label]] (a wikilink) which could be expanded via simple regexp subst. The biggest problem with the latter is that the label for the templates wiki-link isn't present in the extracted text - and this happens a lot - as least in wikinews. I think this could be considered low-hanging fruit which could done prior to parsing, wouldn't require proper template expansion, and would greatly improve the extracted text.

bdewilde commented 5 years ago

Hey @ckot , please keep posting issues — and better yet, PRs. ;) I really appreciate you catching all these edge cases and expanding the code's functionality. In the meantime, I'm going to merge and close this PR out, and maybe commit the aforementioned regex update.