Intermediate URL redirect

nmandic78 commented 4 months ago

It looks google now only provide their intermediate URL that redirects to real news site URL: 'news.google.com/articles/CBMiU2h0dHBzOi8vd3d3LnRoZXZlcmdlLmNvbS8yMDI0LzIvMTQvMjQwNzI3OTIvYXBwbGUtdmlzaW9uLXByby1lYXJseS1hZG9wdGVycy1yZXR1cm5z0gEA?hl=en-US&gl=US&ceid=US%3Aen'

Tried to get redirected URL with requests, but it seems Google use javascript and this won't do. I get to consent page. I don't know how to tackle it without Selenium or similar and this is overhead I don't want for my project. If someone has solution or pointer in right direction, I will be grateful.

talhaanwarch commented 2 months ago

Try this

urls = googlenews.get_links()

after getting the urls of news, you have to do it one by one. here is an example

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

def get_final_url(initial_url):
    # Configure Chrome options for headless mode
    chrome_options = Options()
    chrome_options.add_argument("--headless")

    # Set up Chrome WebDriver
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)

    try:
        # Open the initial URL
        driver.get(initial_url)

        # Wait until certain elements are present indicating that content has loaded
        wait = WebDriverWait(driver, 10)
        final_url_element = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article")))
        final_url = driver.current_url
        return final_url
    except TimeoutException:
        print("Timed out waiting for page to load")
        return None
    finally:
        # Close the WebDriver session
        driver.quit()

# Example usage:
initial_url = f"https://{urls[0]}"
final_url = get_final_url(initial_url)
if final_url:
    print("Final URL after content loaded:", final_url)

nmandic78 commented 2 months ago

@talhaanwarch , thank you. As I said, Selenium is overkill for my use case so I dropped this lib and solved what I needed with Bing Search API. Anyway, thank you and maybe somebody finds your snippet useful. Regards.

deanm0000 commented 2 months ago

This is much simpler than it seems, you don't even need to BeautifulSoup it.

Use this:

def get_link_url(txt):
    i = txt.find("Opening")
    j = txt.find("a href=", i)
    k = txt.find('"', j + 8)
    return txt[j + 8 : k]

You still have to GET the intermediate URL but if you do:

resp=requests.get(intermediate_url)
real_link=get_link_url(resp.text)

It relies on a bit of code in the intermediate page that you're supposed to see if it doesn't redirect fast enough that tells you it's "Opening". You just use normal python find to look for that. Then you look for where the url begins immediately after that. Then you find where the url ends and extract it. Poof no selenium (or even bs4) required.

HurinHu commented 2 months ago

Be aware sending too many requests to Google may get 429 errors. Each link will send to Google first then get the actual link.

deanm0000 commented 2 months ago

I ended up really wanting async support so I wrote my own which skips the intermediate URL altogether. In doing this DIY, I'm not actually sure where the intermediate URL comes from as the real URL is right there. It's not very pretty or typed so its not ready to be its own repo but if somebody wants to clean it up and incorporate it here or publish it elsewhere then please do:

import httpx
from bs4 import BeautifulSoup
from headers import HEADERS
from urllib.parse import quote_plus

async def search_news(search_terms, date_range=None):
    params = dict(q=quote_plus(search_terms), tbm="nws")
    if date_range is not None and isinstance(date_range, (list, tuple)):
        start_date = date_range[0].strftime("%m/%d/%Y")
        end_date = date_range[1].strftime("%m/%d/%Y")
        params["tbs"] = quote_plus(f"cdf:1,cd_min:{start_date},cd_max:{end_date}")
    dlclient = httpx.AsyncClient(http2=True, headers=HEADERS)
    resp = await dlclient.get(
        "https://www.google.com/search",
        params=params,
    )
    rbs = BeautifulSoup(resp, features="lxml")
    links = [
        x
        for x in rbs.find_all("a")
        if "href" in x.attrs
        and "https" in x.attrs["href"]
        and "google" not in x.attrs["href"]
    ]

    pages = []
    for link in links:
        url = link.attrs["href"]
        url_begin = url.find("https")
        url = url[url_begin:].split("&")[0]
        misc = link.find_all(string=True)
        misc = [x for x in misc if x.parent.name == "div"]
        pages.append({"url": url, "title": misc[0], "misc": misc[1:]})
    return pages

it assumes you have a file headers.py with a dict of headers in a variable called HEADERS. Google doesn't actually seem to mind if you don't use browser headers so it's probably superfluous.

Iceloof / GoogleNews

Intermediate URL redirect #140