Open monday0rsunday opened 9 years ago
@monday0rsunday good catch!
I just ran:
>>> from newspaper import Article
>>> url = 'http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm'
>>> a = Article(url, language='vi')
>>> a.download()
>>> a.parse()
>>> print a.text
Về lý do thu hồi, theo UBND TP, việc giải ...
>>> a.nlp()
>>> a.keywords
[u'nh', u'c', u'b', u'cn', u'truyn', u'bn', u'ca', u'trn', u'l', u'vn', u'cho', u's', u'hi', u'tp', u'v', u'tphcm']
So it appears that the nlp(..)
code is killing all the proper markup in the vietnamese and turning it into ascii.
OK, after a bit more digging the exact line where the stripping happens is here: https://github.com/codelucas/newspaper/blob/master/newspaper/nlp.py#L95
After commenting out that line of code your problems are fixed, but I'm going to have to look into how it affects other examples because that line may be important to the NLP algorithm.
This problem also appears to happen in german text with umlaut:
print(artikel.keywords)
[..., u'erzhlen', ...']
The actual word is most likely "erzählen"...
I try to use newspaper for vietnam's news, everything seem good but keyword extraction. For example, with http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm , extracted keywords are: "s, nh, bn, v, vn, truyn, c, cho, tp, b, cn, trn, hi, tphcm, ca, l", but "s", "nh", "bn", "v", "vn", "truyn", etc. are meaningless words. I think that HTML entity is the root of the problem, for example, vietnamese word 'truyền' --> html string 'truyền' --> the correct keyword is 'truyền', but the software extract 'truyn'.