buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.66k stars 348 forks source link

Readability of MSN articles #181

Open rpdelaney opened 10 months ago

rpdelaney commented 10 months ago

I'm struggling to get this working with MSN news articles. Here's the approach I'm using:

def fetch_url(url: str, timeout: int = 10) -> str:
    """Get the content from a page at URL, if it is a URL."""
    if not is_url(url):
        return url

    response = requests.get(url, timeout=timeout)
    response.raise_for_status()
    soup = bs(response.content, "html.parser")

    return soup.get_text()

def summarize(content: str) -> str:
    """Take content and use readability to return a document summary."""
    doc = Document(content)

    title: str = doc.short_title()
    summary: str = bs(doc.summary(), "lxml").text

    return f"{title}\n{summary}"

This works well on all the other news sites I've tried, but with MSN it's different.

Example. With this URL, I only get MSN for a title and the summary is empty.

Any suggestions?