Closed rlewkowicz closed 6 years ago
Maybe it's related to the user-agent, I don't get any feed using the CLI :
./bin/feedio read https://www.washingtonpost.com/blogs/monkey-cage/feed
outputs
malformed xml string. parsing error : DOMDocument::loadXML(): Specification mandate value for attribute itemscope in Entity, line: 1 (2)
Maybe you could try using a custom user-agent like in this example : https://github.com/alexdebril/feed-io/blob/c22ab80b6ca9c9af08373f3e6343ded763edcf74/examples/change-user-agent.php
Hi @rlewkowicz
I think what's going on. If I try to get https://www.washingtonpost.com/blogs/monkey-cage/feed/?noredirect=on for the first time in my browser, I get a subscription page. I click on the "free" offer, then I get a GDPR-related page. I accept the conditions (without reading them of course) and then only I get the feed. If I remove a cookie called "wp_gdpr", I get the subscription again.
Then, I tried with the following script :
require __DIR__.DIRECTORY_SEPARATOR.'bootstrap.php';
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\MessageFormatter;
use Monolog\Logger;
$logger = new Logger('Logger');
$stack = HandlerStack::create();
$stack->push(
Middleware::log(
$logger,
new MessageFormatter('{request}')
)
);
$client = new \FeedIo\Adapter\Guzzle\Client(
new GuzzleHttp\Client([
'handler' => $stack,
]),
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
);
$response = $client->getResponse('https://www.washingtonpost.com/blogs/monkey-cage/feed/?noredirect=on', new \DateTime('@0'));
$feedIo = new \FeedIo\FeedIo($client, $logger);
$result = $feedIo->read('https://www.washingtonpost.com/blogs/monkey-cage/feed');
echo "feed title : {$result->getFeed()->getTitle()} \n ";
foreach ($result->getFeed() as $item) {
echo "item title : {$item->getTitle()} \n ";
}
feed-io gets redirected to /gdpr-consent/?destination=%2fblogs%2fmonkey-cage%2ffeed%3f and fails to parse it.
And now, I send the cookie through Guzzle :
require __DIR__.DIRECTORY_SEPARATOR.'bootstrap.php';
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\MessageFormatter;
use Monolog\Logger;
use GuzzleHttp\Cookie\CookieJar;
$logger = new Logger('Logger');
$stack = HandlerStack::create();
$stack->push(
Middleware::log(
$logger,
new MessageFormatter('{request}')
)
);
$cookieJar = CookieJar::fromArray([
'wp_gdpr' => '1|1'
], 'www.washingtonpost.com');
$client = new \FeedIo\Adapter\Guzzle\Client(
new GuzzleHttp\Client([
'handler' => $stack,
'cookies' => $cookieJar
]),
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
);
$response = $client->getResponse('https://www.washingtonpost.com/blogs/monkey-cage/feed/?noredirect=on', new \DateTime('@0'));
$feedIo = new \FeedIo\FeedIo($client, $logger);
$result = $feedIo->read('https://www.washingtonpost.com/blogs/monkey-cage/feed');
echo "feed title : {$result->getFeed()->getTitle()} \n ";
foreach ($result->getFeed() as $item) {
echo "item title : {$item->getTitle()} \n ";
}
I got this :
feed title : Monkey Cage item title : Russia used to see itself as part of Europe. Here’s why that changed. item title : Trump’s tariffs aren’t the biggest trade problem. Will China step up to protect the WTO? item title : Last week’s IG report about the FBI made a big splash. Here’s what you need to know about inspectors general. item title : What political science can tell us about mass shootings item title : If more states start using Ohio’s system, how many voters will be purged? item title : Four things you should know about mutinies item title : Why Melania Trump isn’t as popular as Laura Bush or Michelle Obama item title : Will Colombia’s next president be a former left-wing guerrilla? item title : Armed peacekeepers really do protect civilians — with one big exception item title : Russia is hosting this year’s World Cup. What could go wrong?
Seriously, I have no clue on how to fix this. But at least you've got a workaround.
Hi @rlewkowicz
I delivered a workaround for this very special case and I don't think it's necessary to patch feed-io for this.
With most rss feeds I grab them, and create timestamps. Then on the next run, I grab all articles since the last run. With Washington post, it's just empty but it works with any other feed.
You could pick any of these: https://www.washingtonpost.com/rss-feeds/2014/08/04/ab6f109a-1bf7-11e4-ae54-0cfe1f974f8a_story.html?noredirect=on&utm_term=.785babb51f11
I tried "Dealing with missing timezones" from the readme, but I don't think that was it.