Text of old articles looks cleaned

Datafable / epu-index

EPU index

http://www.applieddatamining.com/cms/?q=content/economic-policy-uncertainty-index

1 stars 0 forks source link

Text of old articles looks cleaned #55

Closed bartaelterman closed 9 years ago

bartaelterman commented 9 years ago

The text of the articles I got that needed to get migrated in the app looks cleaned. It contains no punctuation marks at all. Is this required for newly scraped articles?

bartaelterman commented 9 years ago

I found separate Python scripts using CLipS' pattern.nl module. Maybe this library was used to map words to their lemma's.

bartaelterman commented 9 years ago

Other option: the stripText method in the AbstractCrawler class of the old scrapers uses a Jsoup.parse method. If I look at the documentation that would parse input such as:

"<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"

into

An example link

And hence drops the period at the end of the sentence.

bartaelterman commented 9 years ago

Text will be cleaned like this:

        cleaned_text = ' '.join(re.findall('\w+', intext, flags=re.UNICODE)).encode('utf-8')

bartaelterman commented 9 years ago

Certain characters will need to be replaced by an ascii character. (such as ñ by n etc.).

bartaelterman commented 9 years ago

Will need a list from customer with all characters and replacements.