I have identified a set of DIV blocks on a newspaper frontpage, each DIV block has info about an article that I am trying to extract.
The HTML structure for each story may look like below.
It's easy to fetch easy story because they all have a Story Class.
But once I have the "teaser block", I want to extract info inside:
Heading
Category
Story summary field ("Teaser for the story")
Article url
Photo url
With $node->html(), I am able to get the inside HTML (good!). But I need to take it a step further to get to the actual contents (heading, category, url and story summary).
Here is what I have tried so far:
<?php
require 'vendor/autoload.php';
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<h1>News Section</h1>
<p class="message">Hello World!</p>
<div id="ab_1234" class="story">
<div class="photo">
<img src="https://www.example.com/images/photo_1234.jpg">
</div>
<div>
<span class="heading">Foo</span>
<span class="category">News</span>
<span class="read_more"><a href="https://www.example.com/news/1234.html">Read more</a></span>
Teaser for the story
</div>
</div>
<div id="ab_1235" class="story">
<div class="photo">
<img src="https://www.example.com/images/photo_1235.jpg">
</div>
<div>
<span class="heading">Bar</span>
<span class="category">Sport</span>
<span class="read_more"><a href="https://www.example.com/sport/1235.html">Read more</a></span>
Teaser for the story
</div>
</div>
</body>
</html>
HTML;
use Goutte\Client;
$client = new Client();
use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
$link = $crawler->filter('.story')->each(function ($node) {
return [
'url' => $node->attr('href'),
'title' => $node->attr('title'),
'text' => trim($node->text()),
'html' => trim($node->html()),
];
});
print_r($link);
As you see, the result is unsuccessful so far:
Array
(
[0] => Array
(
[url] =>
[title] =>
[text] => Foo
News
Read more
Teaser for the story
[html] => <div class="photo">
<img src="https://www.example.com/images/photo_1234.jpg">
</div>
<div>
<span class="heading">Foo</span>
<span class="category">News</span>
<span class="read_more"><a href="https://www.example.com/news/1234.html">Read more</a></span>
Teaser for the story
</div>
)
[1] => Array
(
[url] =>
[title] =>
[text] => Bar
Sport
Read more
Teaser for the story
[html] => <div class="photo">
<img src="https://www.example.com/images/photo_1235.jpg">
</div>
<div>
<span class="heading">Bar</span>
<span class="category">Sport</span>
<span class="read_more"><a href="https://www.example.com/sport/1235.html">Read more</a></span>
Teaser for the story
</div>
)
)
I have identified a set of DIV blocks on a newspaper frontpage, each DIV block has info about an article that I am trying to extract.
The HTML structure for each story may look like below. It's easy to fetch easy story because they all have a Story Class. But once I have the "teaser block", I want to extract info inside:
With
$node->html()
, I am able to get the inside HTML (good!). But I need to take it a step further to get to the actual contents (heading, category, url and story summary).Here is what I have tried so far:
As you see, the result is unsuccessful so far:
Any ideas?