codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.16k stars 2.12k forks source link

Chinese language for content #436

Open SamuelBWasserman opened 7 years ago

SamuelBWasserman commented 7 years ago

I just get chinese letters for american websites and even when I set the language='en' it spits out giberish.

codelucas commented 7 years ago

Please paste specific examples of the bugs so we can reproduce and help, thanks

SamuelBWasserman commented 7 years ago

Hello,

Thanks for responding! Since I haven't been getting many foreign characters, although I still get some. I'll be sure to screenshot and show you my problem if it comes up again.

-Sam

On Wed, Sep 13, 2017 at 9:24 PM, Lucas Ou-Yang -- 欧阳象 < notifications@github.com> wrote:

Please paste specific examples of the bugs so we can reproduce and help, thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/codelucas/newspaper/issues/436#issuecomment-329342150, or mute the thread https://github.com/notifications/unsubscribe-auth/ALveBSGcO9lqaZnskm1ufRfRDNhzxu73ks5siIAzgaJpZM4PTq7P .

-- Samuel Wasserman Rutgers University School of Arts and Sciences - Class of 2019 Computer Science & Mathematics

codelucas commented 7 years ago

👍 sounds good

SamuelBWasserman commented 7 years ago

Hello,

So I have attached a couple screenshots. I am using newspaper to scrape news websites and this is an example of listing all article titles. I am scraping sites such as techcrunch, cnn, fox etc. and I randomly get foreign characters as you can see.

I'm building the papers with this specification...'tech_crunch = newspaper.build(

'https://www.techcrunch.com/', memoize_articles=False, language='en')'

Im running this code 'for news_article in source.articles:

try:
    news_article.download()
    news_article.parse()'

then listing the titles.

-Sam

On Fri, Sep 15, 2017 at 3:02 PM, Lucas Ou-Yang -- 欧阳象 < notifications@github.com> wrote:

👍 sounds good

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/codelucas/newspaper/issues/436#issuecomment-329871900, or mute the thread https://github.com/notifications/unsubscribe-auth/ALveBXAFtYc0NX8Dw0_k6RyMT_BEEKLeks5sismygaJpZM4PTq7P .

-- Samuel Wasserman Rutgers University School of Arts and Sciences - Class of 2019 Computer Science & Mathematics

harishaaram commented 7 years ago

Hi @codelucas

I too face the same issue, this is the code I used,

newspaper.build( 'http://www.huffingtonpost.com/', memoize_articles=False, language='en')

These are some languages that are not english which got parsed.

lanerro2 langerro3 langerror