strip_markup in textacy.datasets.wikipedia doesn't work as expected

I am trying to generate text and metadata streams from Wikipedia. I am using wikipedia to retrieve article content. I then want to remove MediaWiki markup, so I've tried to use "strip_markup" from textacy.datasets.wikipedia. For some articles this produces empty strings instead of stripping markup correctly.

Expected Behavior

I was expecting the function to clean markup from text formatted from Wikipedia the same as it formats articles pulled from a dump.

Current Behavior

Empty strings are produced for some articles but not others.

Steps to Reproduce (for bugs)

''' import wikipedia from textacy.datasets.wikipedia import strip_markup

def generate_streams(page_titles): text_list = [] meta_list = []

for title in page_titles:
    wikipage = wikipedia.WikipediaPage(title)

    text = str(wikipage.content)
    text = strip_markup(text)

    text_list.append(text)
    meta_list.append({'title': wikipage.title, 
                      'categories': wikipage.categories, 
                      'links': wikipage.links})

return text_list, meta_list;

'''

Context

I'm only interested in a subset of Wikipedia as a corpus. It's too large to work with efficiently on a single computer and a lot of it is irrelevant to what I'm trying to accomplish. I wrote a script that selects related articles from Wikipedia in a manner similar to spreading activation. The articles however aren't stripped of markup in a similar way.

Your Environment

operating system: macOS 10.13
python version: 3.5.3
spacy version: 2.0.9
installed spacy models: 'en_core_web_sm' (I think; I have all the en models downloaded)
textacy version: 0.6.1

Hi @Tyjch , there are several possible reasons for the behavior you've observed.

The strip_markup() function has only been tested on (and is only intended for) database dumps of wikipedia articles, so it may not work well on text produced by the wikipedia package — I don't know if or how those formats differ. Do you know?
There may be pages that are entirely markup or have formatting errors, so stripping the markup as-is correctly produces a blank page. I'm thinking of, say, pages that are actually just redirects for other pages. Can you provide examples of pages, specifically, that are coming back blank?
It's also totally possible that the strip_markup() function is incorrectly stripping out the full text of a page. Maybe it doesn't handle all edges cases, or some bit of wikimedia formatting has recently changed, or there's just a bug that certain combinations of markup triggers. Hard to say, but: the wikimedia format is very complex, and it's basically impossible to correctly parse every possible expression of it.

chartbeat-labs / textacy