chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

strip_markup in textacy.datasets.wikipedia doesn't work as expected #197

Closed tyjch closed 5 years ago

tyjch commented 6 years ago

I am trying to generate text and metadata streams from Wikipedia. I am using wikipedia to retrieve article content. I then want to remove MediaWiki markup, so I've tried to use "strip_markup" from textacy.datasets.wikipedia. For some articles this produces empty strings instead of stripping markup correctly.

Expected Behavior

I was expecting the function to clean markup from text formatted from Wikipedia the same as it formats articles pulled from a dump.

Current Behavior

Empty strings are produced for some articles but not others.

Steps to Reproduce (for bugs)

''' import wikipedia from textacy.datasets.wikipedia import strip_markup

def generate_streams(page_titles): text_list = [] meta_list = []

for title in page_titles:
    wikipage = wikipedia.WikipediaPage(title)

    text = str(wikipage.content)
    text = strip_markup(text)

    text_list.append(text)
    meta_list.append({'title': wikipage.title, 
                      'categories': wikipage.categories, 
                      'links': wikipage.links})

return text_list, meta_list;

'''

Context

I'm only interested in a subset of Wikipedia as a corpus. It's too large to work with efficiently on a single computer and a lot of it is irrelevant to what I'm trying to accomplish. I wrote a script that selects related articles from Wikipedia in a manner similar to spreading activation. The articles however aren't stripped of markup in a similar way.

Your Environment

bdewilde commented 6 years ago

Hi @Tyjch , there are several possible reasons for the behavior you've observed.