codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.15k stars 2.12k forks source link

Downloaded article from Google RSS News only return Google Images #1003

Open andisoer opened 3 months ago

andisoer commented 3 months ago

For the last few days, the parser using

article.download('https://news.google.com/rss/articles/CBMiTmh0dHBzOi8vd3d3Lm55dGltZXMuY29tLzIwMjQvMDcvMjEvdXMvcG9saXRpY3MvdmFuY2UtdHJ1bXAtY2FtcGFpZ24tcmFsbHkuaHRtbNIBAA?oc=5&hl=en-ID&gl=ID&ceid=ID:en')
article.parse()
print('article.title')
print('article.top_image')

only return Google RSS Images which is

https://lh3.googleusercontent.com/J6_coFbogxhRI9iM864NL_liGXvsQp2AupsKei7z0cNNfDvGUmWUy20nuUhkREQyrpY4bEeIBuc=s0-w300

and the title

Google News

instead of original articles images and titles, any issue on this parser or any update from Google RSS News?

sunitab55 commented 2 months ago

I just tried something with Linkedin newsetters and it doesn't capture anything :/

Ronkiro commented 1 month ago

There was a update from Google's side, the ID after /article/ used to be a base64 string representing the original website. Since July, that changed and is not real anymore (community doesn't seems to know how to parse it btw)

Here's a reference: https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e

There's some community's member implementation of this code in Python -> https://github.com/SSujitX/google-news-url-decoder/blob/main/googlenewsdecoder/new_decoderv1.py

This requests Google for the URL though, so it may hit some 429's (which are very annoying). But i found no other solution but to do that before sending the URL to newspaper3k.