Casualtek / Cyberwatch

Building a consolidated RSS feed for articles about cyberattacks
Other
50 stars 13 forks source link

[BUG] Google News URL decode not working anymore #5

Open moehmeni opened 1 month ago

moehmeni commented 1 month ago

For decoding Google News URLs into their real ones, I am getting error:

import base64
import re
from typing import Optional

# Ref: https://github.com/Casualtek/Cyberwatch/blob/8648e9ad646e708dd1d801d6e2ebb3c40539ffde/rss.py#L111
_ENCODED_URL_PREFIX = "https://news.google.com/rss/articles/"
_ENCODED_URL_RE = re.compile(
    rf"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)"
)
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')

def decode_google_news_url(url: str) -> Optional[str]:
    match = _ENCODED_URL_RE.match(url)
    encoded_text = match.groupdict()["encoded_url"]  # type: ignore
    encoded_text += (
        "==="  # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
    )
    decoded_text = base64.urlsafe_b64decode(encoded_text)

    match = _DECODED_URL_RE.match(decoded_text)
    primary_url = match.groupdict()["primary_url"]  # type: ignore
    primary_url = primary_url.decode()
    return primary_url

# Test the function
url = "https://news.google.com/rss/articles/CBMi2AFBVV95cUxQOHZlbFBOSXZDQTVDNWhibW9nMlUzaWpfbVRZaTNKMXd4VFNtQ2YxQWt2UmtDbHdia2xvbHZDMU03eXVabzFscDdMcHV4aGFnNW1zdU9zakVyaEFmMm1FVDVBRVotdktTbkJBOUFrT3dwNTY5bVNzZWRJQk1RT3l5SnBBeWdXS1laeVpwejQzN3luZjgwVjN0bFB5NkZSM2oxRXJ6Q0ItbDNMUDZJRTdEZXhjbUV1Z3NYMHdXV1hKV3N3YndWOVZjVE9uZlBGNkk0SS1mbTZ3b0Q?oc=5"
result = decode_google_news_url(url)
print("Result:", result)
primary_url = match.groupdict()["primary_url"]  # type: ignore
AttributeError: 'NoneType' object has no attribute 'groupdict'

I think they changed recently because it was working just yesterday.

Casualtek commented 1 month ago

Thanks. I noticed it as well. I'm working on a fix. Any suggestions will be welcome though ;)