FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

403 problem #342

Open kylong opened 6 years ago

kylong commented 6 years ago

My crawler is getting data from this link

https://weban.jp/ljbtyp08/mjbtyp080814/

from February this year, so many countries are dined. Fortunately, its document still allows to get.

But Goutte returns 403 even tried and worked with curl and file_get_content.

Tried with many setting for Goutte: header, proxy

Proxy 1

$client = new Client();
$guzzleClient = new GuzzleClient(array(
    'proxy' => 'xxx.xxx.xxx.xxx:8080',
));
$client->setClient($guzzleClient);
$resource = $client->request('GET', $url);

Proxy 2

$client = new Client();
$resource = $client->request('GET', $url, ['proxy' => ['https' => 'xxx.xxx.xxx.xxx:8118'] ]);

Header

$client = new Client();
$resource = $client->request('GET', $url, [
          'referer' => true,
          'headers' => [
            'User-Agent' => 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
            'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding' => 'gzip, deflate, br',
          ],

]);

is there any solution for it ? please help

Thank you

diro7 commented 6 years ago

Hi.

this works for me

  $image = $node->filter('img')->attr("src");

  $image = file_get_contents($image);

  $image = base64_encode($image);

  $image = 'data:image/png;base64,' . $image; 

`

ngoviethung commented 3 years ago

Attention Required! | Cloudflare How to fix it?

stof commented 3 years ago

@ngoviethung Well, this looks like the kind of response you get when the website blocks you by using the cloudflare bot protection system. If the website forbids bots, there is nothing you can do to make it work when using the goutte crawler