FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

[Question] Issue scraping html table with Goutte #431

Closed Gabotron-ES closed 2 years ago

Gabotron-ES commented 3 years ago

Hi everybody, I'm trying to scrape an html table of cities by population, with Goutte in laravel, I want to return the html table as php array and then turn it into json and save it to disk.

For some reason when I crawl the table I get an array full of null values, this is my code:

public function crawlAustraliaHtmlTable(Request $request)
    {
        $html='';
        $client = new Client();
        $url = 'http://www.geoba.se/population.php?cc=AU&st=city_rank_country&asde=&page=1';
        $crawler = $client->request('GET', $url);
        //$crawler->addHTMLContent($html);

        $table = $crawler->filter('table')->filter('tr')->each(function ($tr, $i) {
            return $tr->filter('td')->each(function ($td, $i) {
                $td->filter('a')->each(function ($a, $i) {
                    return $a->attr('href');
                });
            });
        });

        //print_r($table);

        $json = json_encode($table);

        $filename = 'cities_in_australia.json';

        File::put(public_path('/uploads/'.$filename),$json);

        return response()->json([
            'json' => $json,
        ]);
    }

The result (notice all the nulls for some reason).

[[null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],[null,null,null,null,null],...]]

The html table structure is like this:

<table border=0 cellpadding=3 cellspacing=3 class="table table-condensed table-noline">

<tr style="font-size: 16px;">

<th class="bottom" valign=top width=50 align=left NOWRAP><b><a class=redglow style="color:#0000FF;" href="population.php?cc=AU&st=crcountry&asde=d&page=1" onClick="recordOutboundLink(this, 'Population - City - Australia', 'Sort By Rank'); return false;">Rank</a></b></td>
<th class="bottom" valign=top width=200 align=left NOWRAP><b><a class=redglow style="color:#0000FF;" href="population.php?cc=AU&st=city&asde=d&page=1" onClick="recordOutboundLink(this, 'Population - City - Australia', 'Sort By City'); return false;">City</a></b></td><th class="bottom" valign=top width=125 align=left><b><a class=redglow style="color:#0000FF;" href="population.php?cc=AU&st=state&asde=d&page=1" onClick="recordOutboundLink(this, 'Population - City - Australia', 'Sort By State'); return false;">State</a></b></td><th class="bottom" valign=top width=100 align=left><b>Country</b></td><th class="bottom" valign=top width=75 align=right NOWRAP><b><a class=redglow style="color:#0000FF;" href="population.php?cc=AU&st=pop&asde=d&page=1" onClick="recordOutboundLink(this, 'Population - City - Australia', 'Sort By Population'); return false;">Population</a></b></td>
<td></td>
</tr>

    <tr style="font-size:13px;" class="bb">
    <td valign=top><a name="1"></a>1.</td>
    <td valign=top><a class=redglow style="color:#0000FF;" href="/location.php?query=2158177&geoid=Y" onClick="recordOutboundLink(this, 'Population - City - Australia', 'Melbourne'); return false;">Melbourne</a></td>
    <td valign=top width=150><a class=redglow style="color:#0000FF;" href="population.php?sc=Victoria&state=Victoria" onClick="recordOutboundLink(this, 'Population - City - Australia', 'Victoria'); return false;">Victoria</a></td><td valign=top><a class=redglow style="color:#0000FF;" href="country.php?cc=AU&year=2020" onClick="recordOutboundLink(this, 'Population - City - Australia', 'Australia'); return false;">Australia</a></td>
    <td valign=top align=right>3,730,206</td>

    <tr style="font-size:13px;" class="bb">