FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Goutte can't find elements that are out of view or still haven't loaded #423

Open matveynikon opened 3 years ago

matveynikon commented 3 years ago

I am trying to make a simple youtube seo tool with goutte. It is supposed to search for a keyword, find a certain video and print the position at which the video is at for that keyword. My problem is that my goutte bot can't find videos that are under the top 10 results. I suppose that is either because those videos haven't loaded yet because for those videos to load a person has to actually scroll down(which I am unable to do with goutte) or because the video is simply out of view port.

Does anyone know a solution? Or If anyone knows if there is a way to scroll in goute, please tell me.

My code:

<?php require 'vendor/autoload.php'; use Goutte\Client;

$client = new Client(); sleep(1); $crawler = $client->request('GET', 'https://www.youtube.com/results?search_query=php+web+scraping'); sleep(5); $crawler->selectLink('php web scraping tutorial(simple)')->link();//this video is in the top 30 ?>

jeromegamez commented 3 years ago

I had the same issue with another site and, while debugging, stumbled upon the mention of a HTML5 class in the Crawler class of the DOMCrawler component:

use Masterminds\HTML5;
// ...
$this->html5Parser = class_exists(HTML5::class) ? new HTML5(['disable_html_ns' => true]) : null;

A follow-up Google search then lead me to https://github.com/Masterminds/html5-php and https://symfony.com/blog/new-in-symfony-4-3-better-html5-parser-for-domcrawler

Long story short: a composer require masterminds/html5 solved the issue for me 🥳