Open edsu opened 7 years ago
Should be interesting to compare Python readability with the JavaScript version that it is based on.
Weird. I just ran the above, and I get:
<html><body><div><div class="nav-news-content" data-reactid="411"><h3 class="nav-news-headline" data-reactid="412"><span class="nav-news-category" data-reactid="413"/>The Good Wife spinoff The Good Fight takes on Donald Trump</h3><p class="nav-news-abstract show-for-medium-up" data-reactid="415">The Good Fight will feature themes critical about the new presidency as well as satirize the Liberal reaction. Trump seems to take opposition by entertainers more seriously than traditional press coverage, creator Robert King says</p></div></div></body></html
If I view the source of https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html I see that text in a large chunk of JavaScript here ...or just ctrl+F "Good Wife"
Run the same thing a few minutes later and I get:
<html><body><div><div class="nav-news-content" data-reactid="411"><h3 class="nav-news-headline" data-reactid="412"><span class="nav-news-category" data-reactid="413"/>Late-night hosts mock Trump for calling Meryl Streep ‘overrated’ on Twitter </h3><p class="nav-news-abstract show-for-medium-up" data-reactid="415">On Jimmy Kimmel Live, there was an exchange between Ben Affleck and Kimmel about Trump's tweet with Affleck saying 'if you look up in the encyclopedia ‘great actress,’ it’s a picture of Meryl Streep.’ </p></div></div></body></html>
Wild. I guess now we know why we're getting the diffs. What the heck is going on? Could they be serving advertisements randomly to people?
Looks like it isn't actually grabbing the body consistently. I'm not seeing a way to really tweak Readability either.
.content()
is going to return too much, and lead to more false positives, right?
I've put the torstar account on pause until we can figure this one out since it's putting out so many false positives.
That's a wise move. Definitely leaving it open because I bet we run up against this type of issue with other sites.
@edsu I think things have resolved themselves for the most part with the recent commits. What we were seeing before is now coming through like this tweet: https://twitter.com/torstar_diff/status/842119916958453762 -- So, maybe resolving #28 might fully resolve this issue?
@ruebot noticed a series of odd updates like this which led to the discovery that readability returns very little content sometimes. For example:
returns (at the moment):
Perhaps there should be a configurable threshold below which the content will be ignored or at least not tweeted? Could readability be tuned in this case to return content that is more appropriate like the text of the AP press release?