alexdebril / feed-io

A PHP library to read and write feeds in JSONFeed, RSS or Atom format
https://alexdebril.github.io/feed-io/
MIT License
256 stars 54 forks source link

Washington post feeds with time diff don't return. #181

Closed rlewkowicz closed 6 years ago

rlewkowicz commented 6 years ago

With most rss feeds I grab them, and create timestamps. Then on the next run, I grab all articles since the last run. With Washington post, it's just empty but it works with any other feed.

You could pick any of these: https://www.washingtonpost.com/rss-feeds/2014/08/04/ab6f109a-1bf7-11e4-ae54-0cfe1f974f8a_story.html?noredirect=on&utm_term=.785babb51f11

I tried "Dealing with missing timezones" from the readme, but I don't think that was it.

alexdebril commented 6 years ago

Maybe it's related to the user-agent, I don't get any feed using the CLI :

./bin/feedio read https://www.washingtonpost.com/blogs/monkey-cage/feed

outputs

malformed xml string. parsing error : DOMDocument::loadXML(): Specification mandate value for attribute itemscope in Entity, line: 1 (2) 

Maybe you could try using a custom user-agent like in this example : https://github.com/alexdebril/feed-io/blob/c22ab80b6ca9c9af08373f3e6343ded763edcf74/examples/change-user-agent.php

alexdebril commented 6 years ago

Hi @rlewkowicz

I think what's going on. If I try to get https://www.washingtonpost.com/blogs/monkey-cage/feed/?noredirect=on for the first time in my browser, I get a subscription page. I click on the "free" offer, then I get a GDPR-related page. I accept the conditions (without reading them of course) and then only I get the feed. If I remove a cookie called "wp_gdpr", I get the subscription again.

Then, I tried with the following script :


require __DIR__.DIRECTORY_SEPARATOR.'bootstrap.php';
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\MessageFormatter;
use Monolog\Logger;

$logger = new Logger('Logger');
$stack = HandlerStack::create();
$stack->push(
    Middleware::log(
        $logger,
        new MessageFormatter('{request}')
    )
);

$client = new \FeedIo\Adapter\Guzzle\Client(
    new GuzzleHttp\Client([
        'handler' => $stack,
        ]),
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
);

$response = $client->getResponse('https://www.washingtonpost.com/blogs/monkey-cage/feed/?noredirect=on', new \DateTime('@0'));

$feedIo = new \FeedIo\FeedIo($client, $logger);

$result = $feedIo->read('https://www.washingtonpost.com/blogs/monkey-cage/feed');

echo "feed title : {$result->getFeed()->getTitle()} \n ";

foreach ($result->getFeed() as $item) {
    echo "item title : {$item->getTitle()} \n ";
}

feed-io gets redirected to /gdpr-consent/?destination=%2fblogs%2fmonkey-cage%2ffeed%3f and fails to parse it.

And now, I send the cookie through Guzzle :

require __DIR__.DIRECTORY_SEPARATOR.'bootstrap.php';
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\MessageFormatter;
use Monolog\Logger;
use GuzzleHttp\Cookie\CookieJar;

$logger = new Logger('Logger');
$stack = HandlerStack::create();
$stack->push(
    Middleware::log(
        $logger,
        new MessageFormatter('{request}')
    )
);
$cookieJar = CookieJar::fromArray([
    'wp_gdpr' => '1|1'
], 'www.washingtonpost.com');

$client = new \FeedIo\Adapter\Guzzle\Client(
    new GuzzleHttp\Client([
        'handler' => $stack,
        'cookies' => $cookieJar
        ]),
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
);

$response = $client->getResponse('https://www.washingtonpost.com/blogs/monkey-cage/feed/?noredirect=on', new \DateTime('@0'));

$feedIo = new \FeedIo\FeedIo($client, $logger);

$result = $feedIo->read('https://www.washingtonpost.com/blogs/monkey-cage/feed');

echo "feed title : {$result->getFeed()->getTitle()} \n ";

foreach ($result->getFeed() as $item) {
    echo "item title : {$item->getTitle()} \n ";
}

I got this :

feed title : Monkey Cage item title : Russia used to see itself as part of Europe. Here’s why that changed. item title : Trump’s tariffs aren’t the biggest trade problem. Will China step up to protect the WTO? item title : Last week’s IG report about the FBI made a big splash. Here’s what you need to know about inspectors general. item title : What political science can tell us about mass shootings item title : If more states start using Ohio’s system, how many voters will be purged? item title : Four things you should know about mutinies item title : Why Melania Trump isn’t as popular as Laura Bush or Michelle Obama item title : Will Colombia’s next president be a former left-wing guerrilla? item title : Armed peacekeepers really do protect civilians — with one big exception item title : Russia is hosting this year’s World Cup. What could go wrong?

Seriously, I have no clue on how to fix this. But at least you've got a workaround.

alexdebril commented 6 years ago

Hi @rlewkowicz

I delivered a workaround for this very special case and I don't think it's necessary to patch feed-io for this.