Closed kinther closed 2 years ago
It is not a good way to process these in this library, it wiil rise another issue like ERROR 429, this will create more than ten requests to news.google.com and very high chance to get block by Google. You can do it during post data processing with some delay interval.
Currently all links from this package result in a link similar to:
news.google.com/./articles/CAIiEMOJAsGEHwx_2WzzLm2QVtQqFQgEKg0IACoGCAowrqkBMKBFMLKAAg?uo=CAUiZmh0dHBzOi8vd3d3LmZvcmJlcy5jb20vc2l0ZXMvam9ubWFya21hbi8yMDIyLzAxLzMxL2FwcGxlcy1ibG93b3V0LWVhcm5pbmdzLXByb3ZlLWl0cy1zaGFyZXMtYXJlLWNoZWFwL9IBAA&hl=en-US&gl=US&ceid=US%3Aen
Which may not be the ideal use case for many people who intend to scrape for links. By pre-pending the link to specifically only reach out with HTTPS and importing requests, we can follow the Google redirect and fetch the actual URL, which ends up being:
https://www.forbes.com/sites/jonmarkman/2022/01/31/apples-blowout-earnings-prove-its-shares-are-cheap/?sh=36000f192c73
Tested by importing branch in a Python3 virtual environment and
pip install -e .