Closed AndyTheFactory closed 9 months ago
Comment by codelucas Wed Feb 4 11:42:03 2015
@monday0rsunday good catch!
I just ran:
>>> from newspaper import Article
>>> url = 'http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm'
>>> a = Article(url, language='vi')
>>> a.download()
>>> a.parse()
>>> print a.text
Về lý do thu hồi, theo UBND TP, việc giải ...
>>> a.nlp()
>>> a.keywords
[u'nh', u'c', u'b', u'cn', u'truyn', u'bn', u'ca', u'trn', u'l', u'vn', u'cho', u's', u'hi', u'tp', u'v', u'tphcm']
So it appears that the nlp(..)
code is killing all the proper markup in the vietnamese and turning it into ascii.
Comment by codelucas Wed Feb 4 11:46:12 2015
OK, after a bit more digging the exact line where the stripping happens is here: https://github.com/codelucas/newspaper/blob/master/newspaper/nlp.py#L95
Comment by codelucas Wed Feb 4 11:57:56 2015
After commenting out that line of code your problems are fixed, but I'm going to have to look into how it affects other examples because that line may be important to the NLP algorithm.
Comment by Brandl Tue Mar 24 14:05:13 2015
This problem also appears to happen in german text with umlaut:
print(artikel.keywords)
[..., u'erzhlen', ...']
The actual word is most likely "erzählen"...
Issue by monday0rsunday Fri Dec 5 07:09:24 2014 Originally opened as https://github.com/codelucas/newspaper/issues/93
I try to use newspaper for vietnam's news, everything seem good but keyword extraction. For example, with http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm , extracted keywords are: "s, nh, bn, v, vn, truyn, c, cho, tp, b, cn, trn, hi, tphcm, ca, l", but "s", "nh", "bn", "v", "vn", "truyn", etc. are meaningless words. I think that HTML entity is the root of the problem, for example, vietnamese word 'truyền' --> html string 'truyền' --> the correct keyword is 'truyền', but the software extract 'truyn'.