FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Get article title, url and more inside a DIV block #364

Open klor opened 5 years ago

klor commented 5 years ago

I have identified a set of DIV blocks on a newspaper frontpage, each DIV block has info about an article that I am trying to extract.

The HTML structure for each story may look like below. It's easy to fetch easy story because they all have a Story Class. But once I have the "teaser block", I want to extract info inside:

With $node->html(), I am able to get the inside HTML (good!). But I need to take it a step further to get to the actual contents (heading, category, url and story summary).

Here is what I have tried so far:

<?php
require 'vendor/autoload.php';

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <h1>News Section</h1>
        <p class="message">Hello World!</p>

        <div id="ab_1234" class="story">
            <div class="photo">
                <img src="https://www.example.com/images/photo_1234.jpg">
            </div>
            <div>
                <span class="heading">Foo</span>
                <span class="category">News</span>
                <span class="read_more"><a href="https://www.example.com/news/1234.html">Read more</a></span>
                Teaser for the story
            </div>
        </div>

        <div id="ab_1235" class="story">
            <div class="photo">
                <img src="https://www.example.com/images/photo_1235.jpg">
            </div>
            <div>
                <span class="heading">Bar</span>
                <span class="category">Sport</span>
                <span class="read_more"><a href="https://www.example.com/sport/1235.html">Read more</a></span>
                Teaser for the story
            </div>
        </div>
    </body>
</html>
HTML;

    use Goutte\Client;
    $client = new Client();

    use Symfony\Component\DomCrawler\Crawler;
    $crawler = new Crawler($html);

    $link = $crawler->filter('.story')->each(function ($node) {
        return [
                'url' => $node->attr('href'),
                'title' => $node->attr('title'),
                'text' => trim($node->text()),
                'html' => trim($node->html()),
            ];
    });

print_r($link);

As you see, the result is unsuccessful so far:

Array
(
    [0] => Array
        (
            [url] => 
            [title] => 
            [text] => Foo
                News
                Read more
                Teaser for the story
            [html] => <div class="photo">
                <img src="https://www.example.com/images/photo_1234.jpg">
            </div>
            <div>
                <span class="heading">Foo</span>
                <span class="category">News</span>
                <span class="read_more"><a href="https://www.example.com/news/1234.html">Read more</a></span>
                Teaser for the story
            </div>
        )

    [1] => Array
        (
            [url] => 
            [title] => 
            [text] => Bar
                Sport
                Read more
                Teaser for the story
            [html] => <div class="photo">
                <img src="https://www.example.com/images/photo_1235.jpg">
            </div>
            <div>
                <span class="heading">Bar</span>
                <span class="category">Sport</span>
                <span class="read_more"><a href="https://www.example.com/sport/1235.html">Read more</a></span>
                Teaser for the story
            </div>
        )

)

Any ideas?