codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.92k stars 2.11k forks source link

any paper or algorithm description about text extraction? #665

Open whqwill opened 5 years ago

whqwill commented 5 years ago

any paper or algorithm description about text extraction? I want to know its theory details, thanks

Ask149 commented 5 years ago

Hi @whqwill , can you please specify your needs in detail. Thanks :)

whqwill commented 5 years ago

I mean how it selects the important parts as the 'main text' and if possible any comparison with other methods. @Ask149

bact commented 5 years ago

Not exactly for this newspaper lib, but the slides in this link is very useful overview of the problem: Boilerplate Detection using Shallow Text Features http://www.l3s.de/%7Ekohlschuetter/boilerplate/

whqwill commented 5 years ago

Oh, it is helpful for me. Thanks.

Haiqing

bact notifications@github.com 于2019年1月22日周二 上午11:28写道:

Not exactly for this newspaper lib, but the slides in this link is very useful overview of the problem: Boilerplate Detection using Shallow Text Features http://www.l3s.de/%7Ekohlschuetter/boilerplate/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/codelucas/newspaper/issues/665#issuecomment-456259207, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCjdNxirdbTa-jTvWVcJZlEzDxpFxk8ks5vFoVJgaJpZM4Z0uAq .