Closed michaelbarley closed 4 years ago
My goal is to build a web scraper that receives a webpage URL and returns inbound links.
The particular page i'm scaping has inbound and outbound links. All inbound links don't include http:// or https:// The code looks like:
$client = new Client(HttpClient::create(['timeout' => 60])); $urls = []; $crawler = $client->request('GET', 'https://www.symfony.com'); $crawler->filter('a')->each(function ($node) use (&$urls) { // Exclude outbound links if (strpos($node->attr('href'), '://') !== true) { array_push($urls,$node->attr('href')); } }); var_dump($urls);
My expected result would be an array that does not contain any string with '://'. This is not the case in the actual output which looks like:
array(114) { [0]=> string(1) "/" [1]=> string(22) "https://sensiolabs.com" [2]=> string(1) "/" [3]=> string(16) "/what-is-symfony" [4]=> string(23) "/doc/current/index.html" [5]=> string(25) "https://symfonycasts.com/" }
I'm not sure how these outbound links are still passing my check to remove them.
My goal is to build a web scraper that receives a webpage URL and returns inbound links.
The particular page i'm scaping has inbound and outbound links. All inbound links don't include http:// or https:// The code looks like:
My expected result would be an array that does not contain any string with '://'. This is not the case in the actual output which looks like:
I'm not sure how these outbound links are still passing my check to remove them.