Iceloof / GoogleNews

Script for GoogleNews
https://pypi.org/project/GoogleNews/
MIT License
314 stars 88 forks source link

Update __init__.py #85

Closed kinther closed 2 years ago

kinther commented 2 years ago

Currently all links from this package result in a link similar to:

news.google.com/./articles/CAIiEMOJAsGEHwx_2WzzLm2QVtQqFQgEKg0IACoGCAowrqkBMKBFMLKAAg?uo=CAUiZmh0dHBzOi8vd3d3LmZvcmJlcy5jb20vc2l0ZXMvam9ubWFya21hbi8yMDIyLzAxLzMxL2FwcGxlcy1ibG93b3V0LWVhcm5pbmdzLXByb3ZlLWl0cy1zaGFyZXMtYXJlLWNoZWFwL9IBAA&hl=en-US&gl=US&ceid=US%3Aen

Which may not be the ideal use case for many people who intend to scrape for links. By pre-pending the link to specifically only reach out with HTTPS and importing requests, we can follow the Google redirect and fetch the actual URL, which ends up being:

https://www.forbes.com/sites/jonmarkman/2022/01/31/apples-blowout-earnings-prove-its-shares-are-cheap/?sh=36000f192c73

Tested by importing branch in a Python3 virtual environment and

pip install -e .

HurinHu commented 2 years ago

It is not a good way to process these in this library, it wiil rise another issue like ERROR 429, this will create more than ten requests to news.google.com and very high chance to get block by Google. You can do it during post data processing with some delay interval.