FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Goutte not using httpclient headers #401

Open restucciaquito opened 4 years ago

restucciaquito commented 4 years ago

Hi!

I'm trying to edit the User-Agent as described at The HttpClient Component Documentation but the crawler always use 'Symfony BrowserKit'. I'm doing something wrong or is a bug?

$client = new Client(HttpClient::create(['timeout' => 5000, 'headers' => ['User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36']]));

Thanks for your help.

jmichaelterenin commented 4 years ago

You can set the header in this manner: $client->setHeader('User-Agent', $userAgent);

I'm trying to set the CURL options, hoping someone can help me. All the links for doing this are old.

restucciaquito commented 4 years ago

Thanks for your answer. When I set headers in the way you suggested me I have this error:

Call to undefined method Goutte\Client::setHeader()

kiang commented 4 years ago

As the BrowserKit will override user-agent header during executing, you have to use setServerParameter() to put user agent header back.

use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;

$client = new Client(HttpClient::create(array(
    'headers' => array(
        'user-agent' => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0', // will be forced using 'Symfony BrowserKit' in executing
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language' => 'en-US,en;q=0.5',
        'Referer' => 'http://yourtarget.url/',
        'Upgrade-Insecure-Requests' => '1',
        'Save-Data' => 'on',
        'Pragma' => 'no-cache',
        'Cache-Control' => 'no-cache',
    ),
)));
$client->setServerParameter('HTTP_USER_AGENT', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0');
jmichaelterenin commented 4 years ago

Thank you for that important/critical piece of information, much appreciated!

restucciaquito commented 4 years ago

@kiang, that's the correct answer, just tested, thanks for your great help!

BorisAnthony commented 4 years ago

Is this new? I have multiple scripts in which

use Goutte\Client;
$client = new Client();
$client->setHeader('user-agent','whatever');

has worked flawlessly. Now suddenly—after starting a new project and using the most recent version of Goutte — it fails with PHP Fatal error: Uncaught Error: Call to undefined method Goutte\Client::setHeader()

BorisAnthony commented 4 years ago

Yup. There it is: https://github.com/FriendsOfPHP/Goutte/commit/05f6994ec1d0d8368157de7fe45063e751857086

restucciaquito commented 4 years ago

Yes, this answer is correct (https://github.com/FriendsOfPHP/Goutte/issues/401#issuecomment-591760247), I tested it on an updated version of Goutte.

alexhackney commented 3 years ago

I didn't read the comments thoroughly enough the first time through but This Comment is spot on. I found it after digging through the code with trial and error.

This: $this->client->setServerParameter('HTTP_USER_AGENT', $userAgent);

Sets the user agent properly and solved my issue with a site.

WeeSee commented 3 years ago

This works in the current implementation:

$client = new Client([
    'HTTP_USER_AGENT' => 'Mozilla/5.0 (X11; Linux i686; rv:78.0) Gecko/20100101 Firefox/78.0',
    'HTTP_ACCEPT' => '*/*',
]);

Have a look at the constructor of Symfony\Component\BrowserKit\AbstractBrowser which calls the method setServerParameters(). The Goutte class Client is indirectly derived from Symfony\Component\BrowserKit\AbstractBrowser

adgower commented 3 years ago

This works in the current implementation:

$client = new Client([
    'HTTP_USER_AGENT' => 'Mozilla/5.0 (X11; Linux i686; rv:78.0) Gecko/20100101 Firefox/78.0',
    'HTTP_ACCEPT' => '*/*',
]);

Have a look at the constructor of Symfony\Component\BrowserKit\AbstractBrowser which calls the method setServerParameters(). The Goutte class Client is indirectly derived from Symfony\Component\BrowserKit\AbstractBrowser

can you explain how to do this? when I copy the code it just shows errors and says must be of type HttpClientInterface