codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.14k stars 2.12k forks source link

Keyword extraction problem with vietnamese #93

Open monday0rsunday opened 9 years ago

monday0rsunday commented 9 years ago

I try to use newspaper for vietnam's news, everything seem good but keyword extraction. For example, with http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm , extracted keywords are: "s, nh, bn, v, vn, truyn, c, cho, tp, b, cn, trn, hi, tphcm, ca, l", but "s", "nh", "bn", "v", "vn", "truyn", etc. are meaningless words. I think that HTML entity is the root of the problem, for example, vietnamese word 'truyền' --> html string 'truyền' --> the correct keyword is 'truyền', but the software extract 'truyn'.

codelucas commented 9 years ago

@monday0rsunday good catch!

I just ran:

>>> from newspaper import Article
>>> url = 'http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm'
>>> a = Article(url, language='vi')                                                                                                                                                                                                     
>>> a.download()

>>> a.parse()
>>> print a.text
Về lý do thu hồi, theo UBND TP, việc giải ...

>>> a.nlp()
>>> a.keywords
[u'nh', u'c', u'b', u'cn', u'truyn', u'bn', u'ca', u'trn', u'l', u'vn', u'cho', u's', u'hi', u'tp', u'v', u'tphcm']

So it appears that the nlp(..) code is killing all the proper markup in the vietnamese and turning it into ascii.

codelucas commented 9 years ago

OK, after a bit more digging the exact line where the stripping happens is here: https://github.com/codelucas/newspaper/blob/master/newspaper/nlp.py#L95

codelucas commented 9 years ago

After commenting out that line of code your problems are fixed, but I'm going to have to look into how it affects other examples because that line may be important to the NLP algorithm.

Brandl commented 9 years ago

This problem also appears to happen in german text with umlaut:

print(artikel.keywords)
[..., u'erzhlen', ...']

The actual word is most likely "erzählen"...