Closed tyjch closed 5 years ago
Hi @Tyjch , there are several possible reasons for the behavior you've observed.
strip_markup()
function has only been tested on (and is only intended for) database dumps of wikipedia articles, so it may not work well on text produced by the wikipedia
package — I don't know if or how those formats differ. Do you know?strip_markup()
function is incorrectly stripping out the full text of a page. Maybe it doesn't handle all edges cases, or some bit of wikimedia formatting has recently changed, or there's just a bug that certain combinations of markup triggers. Hard to say, but: the wikimedia format is very complex, and it's basically impossible to correctly parse every possible expression of it.
I am trying to generate text and metadata streams from Wikipedia. I am using wikipedia to retrieve article content. I then want to remove MediaWiki markup, so I've tried to use "strip_markup" from textacy.datasets.wikipedia. For some articles this produces empty strings instead of stripping markup correctly.
Expected Behavior
I was expecting the function to clean markup from text formatted from Wikipedia the same as it formats articles pulled from a dump.
Current Behavior
Empty strings are produced for some articles but not others.
Steps to Reproduce (for bugs)
''' import wikipedia from textacy.datasets.wikipedia import strip_markup
def generate_streams(page_titles): text_list = [] meta_list = []
'''
Context
I'm only interested in a subset of Wikipedia as a corpus. It's too large to work with efficiently on a single computer and a lot of it is irrelevant to what I'm trying to accomplish. I wrote a script that selects related articles from Wikipedia in a manner similar to spreading activation. The articles however aren't stripped of markup in a similar way.
Your Environment
spacy
version: 2.0.9spacy
models: 'en_core_web_sm' (I think; I have all the en models downloaded)textacy
version: 0.6.1