Closed bartaelterman closed 9 years ago
I found separate Python scripts using CLipS' pattern.nl
module. Maybe this library was used to map words to their lemma's.
Other option: the stripText
method in the AbstractCrawler
class of the old scrapers uses a Jsoup.parse
method. If I look at the documentation that would parse input such as:
"<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"
into
An example link
And hence drops the period at the end of the sentence.
Text will be cleaned like this:
cleaned_text = ' '.join(re.findall('\w+', intext, flags=re.UNICODE)).encode('utf-8')
Certain characters will need to be replaced by an ascii character. (such as ñ
by n
etc.).
Will need a list from customer with all characters and replacements.
The text of the articles I got that needed to get migrated in the app looks cleaned. It contains no punctuation marks at all. Is this required for newly scraped articles?