Keyword extraction problem with vietnamese

AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.

MIT License

480 stars 48 forks source link

Keyword extraction problem with vietnamese #7

Closed AndyTheFactory closed 9 months ago

AndyTheFactory commented 1 year ago

Issue by monday0rsunday Fri Dec 5 07:09:24 2014 Originally opened as https://github.com/codelucas/newspaper/issues/93

I try to use newspaper for vietnam's news, everything seem good but keyword extraction. For example, with http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm , extracted keywords are: "s, nh, bn, v, vn, truyn, c, cho, tp, b, cn, trn, hi, tphcm, ca, l", but "s", "nh", "bn", "v", "vn", "truyn", etc. are meaningless words. I think that HTML entity is the root of the problem, for example, vietnamese word 'truyền' --> html string 'truyền' --> the correct keyword is 'truyền', but the software extract 'truyn'.

AndyTheFactory commented 1 year ago

Comment by codelucas Wed Feb 4 11:42:03 2015

@monday0rsunday good catch!

I just ran:

>>> from newspaper import Article
>>> url = 'http://dantri.com.vn/xa-hoi/tphcm-thu-hoi-nha-cua-ong-tran-van-truyen-1002764.htm'
>>> a = Article(url, language='vi')                                                                                                                                                                                                     
>>> a.download()

>>> a.parse()
>>> print a.text
Về lý do thu hồi, theo UBND TP, việc giải ...

>>> a.nlp()
>>> a.keywords
[u'nh', u'c', u'b', u'cn', u'truyn', u'bn', u'ca', u'trn', u'l', u'vn', u'cho', u's', u'hi', u'tp', u'v', u'tphcm']

So it appears that the nlp(..) code is killing all the proper markup in the vietnamese and turning it into ascii.

AndyTheFactory commented 1 year ago

Comment by codelucas Wed Feb 4 11:46:12 2015

OK, after a bit more digging the exact line where the stripping happens is here: https://github.com/codelucas/newspaper/blob/master/newspaper/nlp.py#L95

AndyTheFactory commented 1 year ago

Comment by codelucas Wed Feb 4 11:57:56 2015

After commenting out that line of code your problems are fixed, but I'm going to have to look into how it affects other examples because that line may be important to the NLP algorithm.

AndyTheFactory commented 1 year ago

Comment by Brandl Tue Mar 24 14:05:13 2015

This problem also appears to happen in german text with umlaut:

print(artikel.keywords)
[..., u'erzhlen', ...']

The actual word is most likely "erzählen"...