FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Goutte is bypassing HttpClient base_uri option #427

Closed ssnepenthe closed 2 years ago

ssnepenthe commented 3 years ago

Hopefully I'm not missing anything obvious here...

The last time I used Goutte was before the switch to the Symfony Http Client. I could configure the Guzzle client with a base uri and Goutte would honor it.

So for example I might do something like this before:

$goutte = new Goutte\Client;
$goutte->setClient(new GuzzleHttp\Client([
    'allow_redirects' => false,
    'base_uri' => 'https://duckduckgo.com',
    'cookies' => true,
]));

$response = $goutte->request('GET', '/about');
$response->getUri(); // "https://duckduckgo.com/about"

With the latest version, I can still set a base uri on the http client and it works as expected:

$client = Symfony\Component\HttpClient\HttpClient::create([
    'base_uri' => 'https://duckduckgo.com',
]);
$response = $client->request('GET', '/about');
$response->getInfo('url'); // "https://duckduckgo.com/about"

But when I give that client to goutte it no longer works:

$goutte = new Goutte\Client($client);

$response = $goutte->request('GET', '/about'); // Symfony/Component/HttpClient/Exception/TransportException with message 'Couldn't connect to server for "http://localhost/about".'

Is this intended behavior?

jeystaats commented 3 years ago

Had the same today, quite curious if this is indeed intended.

ssnepenthe commented 2 years ago

For anyone who stumbles on this in the future...

After revisiting this I noticed that the Goutte client extends the BrowserKit HttpBrowser, which in turn extends AbstractBrowser.

AbstractBrowser passes the URI through a method called getAbsoluteUri which can build an absolute URI by using the HTTPS and HTTP_HOST server parameters.

So you can effectively get the same result as configuring the base_uri setting with something like the following:

$goutte = new Goutte\Client;
$goutte->setServerParameter('HTTPS', true);
$goutte->setServerParameter('HTTP_HOST', 'duckduckgo.com');

$response = $goutte->request('GET', '/about');
$response->getUri(); // "https://duckduckgo.com/about"

Closing as it seems this is indeed intended as explained here.