DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

strange diffs getting tweeted #7

Open edsu opened 7 years ago

edsu commented 7 years ago

@ruebot noticed a series of odd updates like this which led to the discovery that readability returns very little content sometimes. For example:


import requests
import readability

html = requests.get("https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html").content
doc = readability.Document(html)

print(doc.summary())

returns (at the moment):

<html><body><div><div class="article__subheadline" data-reactid="93"><p data-reactid="94">The 15-year-old was remanded into secure accommodation on Wednesday and was also charged with possession of an offensive weapon. </p></div></div></body></html>

Perhaps there should be a configurable threshold below which the content will be ignored or at least not tweeted? Could readability be tuned in this case to return content that is more appropriate like the text of the AP press release?

edsu commented 7 years ago

Should be interesting to compare Python readability with the JavaScript version that it is based on.

ruebot commented 7 years ago

Weird. I just ran the above, and I get:

<html><body><div><div class="nav-news-content" data-reactid="411"><h3 class="nav-news-headline" data-reactid="412"><span class="nav-news-category" data-reactid="413"/>The Good Wife spinoff The Good Fight takes on Donald Trump</h3><p class="nav-news-abstract show-for-medium-up" data-reactid="415">The Good Fight will feature themes critical about the new presidency as well as satirize the Liberal reaction. Trump seems to take opposition by entertainers more seriously than traditional press coverage, creator Robert King says</p></div></div></body></html

If I view the source of https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html I see that text in a large chunk of JavaScript here ...or just ctrl+F "Good Wife"

ruebot commented 7 years ago

Run the same thing a few minutes later and I get:

<html><body><div><div class="nav-news-content" data-reactid="411"><h3 class="nav-news-headline" data-reactid="412"><span class="nav-news-category" data-reactid="413"/>Late-night hosts mock Trump for calling Meryl Streep ‘overrated’ on Twitter </h3><p class="nav-news-abstract show-for-medium-up" data-reactid="415">On Jimmy Kimmel Live, there was an exchange between Ben Affleck and Kimmel about Trump's tweet with Affleck saying 'if you look up in the encyclopedia ‘great actress,’ it’s a picture of Meryl Streep.’ </p></div></div></body></html>
edsu commented 7 years ago

Wild. I guess now we know why we're getting the diffs. What the heck is going on? Could they be serving advertisements randomly to people?

ruebot commented 7 years ago

Looks like it isn't actually grabbing the body consistently. I'm not seeing a way to really tweak Readability either.

.content() is going to return too much, and lead to more false positives, right?

ruebot commented 7 years ago

I've put the torstar account on pause until we can figure this one out since it's putting out so many false positives.

edsu commented 7 years ago

That's a wise move. Definitely leaving it open because I bet we run up against this type of issue with other sites.

ruebot commented 7 years ago

@edsu I think things have resolved themselves for the most part with the recent commits. What we were seeing before is now coming through like this tweet: https://twitter.com/torstar_diff/status/842119916958453762 -- So, maybe resolving #28 might fully resolve this issue?