codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

the API doesn't work #961

Open androidAppMe opened 1 year ago

androidAppMe commented 1 year ago

Hi all. I was using newspaper3k and it was working fine, but today it stopped working and returns empty text. Does anyone have any ideas?

cattydev commented 1 year ago

me too having this problem, any fixes?

GalKaplun commented 1 year ago

same here

cattydev commented 1 year ago

i figured out that the api stopped working on google rss article links it was working until 2nd february

banagale commented 1 year ago

Would you please provide a sample implementation or link to a Google rss feed that is now broken?So the error can be reproduced. On Feb 14, 2023, at 7:56 AM, Rıdvan @.***> wrote: i figured out that the api stopped working on google rss article links it was working until 2nd february

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

cattydev commented 1 year ago

Would you please provide a sample implementation or link to a Google rss feed that is now broken?So the error can be reproduced. On Feb 14, 2023, at 7:56 AM, Rıdvan @.> wrote: i figured out that the api stopped working on google rss article links it was working until 2nd february —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.>

this google url, which will redirect to original article link, is taken from google news rss. for example, i scrape (getContent) that google url via newspaper and get a top image which is og:image of the google url which will redirect to original article link: top image

androidAppMe commented 1 year ago

Now I'm having a problem with "trafilatura" API as well. Can't get the body of the news with trafilatura as well, which was working finde before!

cattydev commented 1 year ago

Now I'm having a problem with "trafilatura" API as well. Can't get the body of the news with trafilatura as well, which was working finde before!

are you using google rss too?

androidAppMe commented 1 year ago

Yes, I'm extracting the rss by pygooglenews. but I can't parse it. Could anybody find a solution? I tried to getting the news directly form google news page but it keeps blocking my IP.

cattydev commented 1 year ago

i didnt have issue with parsing google news, it was about google's redirect to original page. i solved the problem adding this before using newspaper's getcontent function:

import requests
import time
r = requests.get("google news url taken from google rss")
time.sleep(1)
#r.url is redirected url
huksley commented 1 year ago

Here you can decode Google RSS urls without have a round pack to the google (https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e)

sorry but it is in javascript