Open SamuelBWasserman opened 7 years ago
Please paste specific examples of the bugs so we can reproduce and help, thanks
Hello,
Thanks for responding! Since I haven't been getting many foreign characters, although I still get some. I'll be sure to screenshot and show you my problem if it comes up again.
-Sam
On Wed, Sep 13, 2017 at 9:24 PM, Lucas Ou-Yang -- 欧阳象 < notifications@github.com> wrote:
Please paste specific examples of the bugs so we can reproduce and help, thanks
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/codelucas/newspaper/issues/436#issuecomment-329342150, or mute the thread https://github.com/notifications/unsubscribe-auth/ALveBSGcO9lqaZnskm1ufRfRDNhzxu73ks5siIAzgaJpZM4PTq7P .
-- Samuel Wasserman Rutgers University School of Arts and Sciences - Class of 2019 Computer Science & Mathematics
👍 sounds good
Hello,
So I have attached a couple screenshots. I am using newspaper to scrape news websites and this is an example of listing all article titles. I am scraping sites such as techcrunch, cnn, fox etc. and I randomly get foreign characters as you can see.
I'm building the papers with this specification...'tech_crunch = newspaper.build(
'https://www.techcrunch.com/', memoize_articles=False, language='en')'
Im running this code 'for news_article in source.articles:
try:
news_article.download()
news_article.parse()'
then listing the titles.
-Sam
On Fri, Sep 15, 2017 at 3:02 PM, Lucas Ou-Yang -- 欧阳象 < notifications@github.com> wrote:
👍 sounds good
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/codelucas/newspaper/issues/436#issuecomment-329871900, or mute the thread https://github.com/notifications/unsubscribe-auth/ALveBXAFtYc0NX8Dw0_k6RyMT_BEEKLeks5sismygaJpZM4PTq7P .
-- Samuel Wasserman Rutgers University School of Arts and Sciences - Class of 2019 Computer Science & Mathematics
Hi @codelucas
I too face the same issue, this is the code I used,
newspaper.build( 'http://www.huffingtonpost.com/', memoize_articles=False, language='en')
These are some languages that are not english which got parsed.
I just get chinese letters for american websites and even when I set the language='en' it spits out giberish.