chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

categories extracted from wikidumps are still present in extracted text #226

Closed ckot closed 5 years ago

ckot commented 5 years ago

Currently using Wikipedia.records() yields page dicts which contain the categories in the 'categories' field, yet the categories are still present in the page's 'text' field (typically at the end of the string).

IMHO, this is a bug, not a feature, as I think the purpose is to extract useful information such as wikilinks, categories, but otherwise remove wiki markup from text, although I could see how other may disagree.

Expected Behavior

categories should be in the 'categories' field and extracted from the 'text' field

Current Behavior

The end (typically) of a Pages 'text' still contains the string representation of the category links

example: \nCategory:Iran\nCategory:Politics and conflicts\nCategory:Nuclear proliferation\nCategory:Nuclear weapons\nCategory:Nuclear power

Possible Solution

Provided that this is agreed to be a bug, the fix should be simple, and I'm willing to submit a PR which does the following (or similar). If instead this is considered a feature, the user could be provided control on whether the the following code is be performed based on whether the non-default value for some new parameter is added to 'records()' .

In _parse_content()) where you're iterating over the page's sections, add an additional call to make ifilter_wikilinks(matches=is_category_link) where 'is_categorylink()' simply returns a bool regarding whether obj.title is in categories (the list of categories already extracted), like all the other existing usages of ifilter*, section.remove(object) would be applied to matches

Steps to Reproduce (for bugs)

simply look at (near) the end of any page's 'text' field whose 'categories' field has a length > 0

Context

I'd prefer that the page's 'text' field was as clean as possible and would require relatively standard english normalization techniques rather than requiring one to have to still remove wiki-specific stuff, especially for items which have already been extracted and exist in a pages other fields.

Your Environment

bdewilde commented 5 years ago

Hi @ckot , I hear your point, but I think I prefer to leave the categories section as-is. Conceptually, I think of the text of related categories as potentially useful content for various NLP tasks, unlike, say, the list of references. There are also technical issues with this — if I understand correctly, your solution would remove category wikilinks throughout the text, rather than removing an entire categories section. No? And if we did remove category sections here, we'd also want to do something similar in the "fast" path to page text via strip_markup(). I'm not sure how that would work. To be honest, I don't remember much about coding this module up except that it was a total nightmare. Wikimedia markup is a lot. 😰

Is there an easy workaround — say, removing lines that start with "\nCategory:", or removing everything after a section named "Categories", or something else?

bdewilde commented 5 years ago

~Well, I think I found a possible implementation for the "not-fast" text path: call wikicode.get_sections(match=[CATEGORY_REGEX]), where that regex is derived from the language-specific cat_link defined earlier in the method, then just wikicode.remove(section). Might be worth exploring...~

Oops, no, pretty sure I mis-remembered how categories are included in the raw wiki markup text... 🤦‍♂️

Looks like wikicode.filter_wikilinks(matches=CATEGORY_REGEX) would work, assuming there are no category wikilinks sprinkled throughout the text. Do you know if that ever happens? If they are always a list of newline-separated category links, it should be doable to write a regex for the fast path that strips them out.

bdewilde commented 5 years ago

This issue is also addressed by PR #235 — would appreciate your input!