FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Unable To exclude outbound links from filter results #420

Closed michaelbarley closed 4 years ago

michaelbarley commented 4 years ago

My goal is to build a web scraper that receives a webpage URL and returns inbound links.

The particular page i'm scaping has inbound and outbound links. All inbound links don't include http:// or https:// The code looks like:

    $client = new Client(HttpClient::create(['timeout' => 60]));
    $urls = [];
    $crawler = $client->request('GET', 'https://www.symfony.com');
    $crawler->filter('a')->each(function ($node) use (&$urls) {
        // Exclude outbound links
        if (strpos($node->attr('href'), '://') !== true) {
            array_push($urls,$node->attr('href'));
        }

    });
    var_dump($urls);

My expected result would be an array that does not contain any string with '://'. This is not the case in the actual output which looks like:

array(114) {
  [0]=>
  string(1) "/"
  [1]=>
  string(22) "https://sensiolabs.com"
  [2]=>
  string(1) "/"
  [3]=>
  string(16) "/what-is-symfony"
  [4]=>
  string(23) "/doc/current/index.html"
  [5]=>
  string(25) "https://symfonycasts.com/"
}

I'm not sure how these outbound links are still passing my check to remove them.