A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https://pepy.tech/project/GoogleNewsScraper)
For stability reasons, we want to replace the use of find_elements_by_class_name or div[contains(@class)], google changes the class names regularly and it breaks our script
We should be able to select what we need using one of the following
select by id (preferred method as this is unlikely to change)
select by tag name (example <img/> for the image_url and <a/> for the url for sure can be used)
select by tag position (for example we know the text content we want is under a > div > div > [div,div,div] (the 3 divs each contain the source, title, and description we need)
@karlgunst It looks like for this one, there are not many ids to go around, but I will change all of the class names we are using to full XPaths and tag Names.
@abnoviello23
For stability reasons, we want to replace the use of
find_elements_by_class_name
ordiv[contains(@class)]
, google changes the class names regularly and it breaks our scriptWe should be able to select what we need using one of the following
<img/>
for theimage_url
and<a/>
for theurl
for sure can be used)