FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

->html() Not working on some websites #359

Closed jLynx closed 5 years ago

jLynx commented 6 years ago

Hi when I use $crawler->html() I get noting for one website, but it works fine on another. Here is my setup

        $client = new Client();
        $client->setHeader('User-Agent', "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36");
        $crawler = $client->request('GET', $url);
        print($crawler->html());

When I try http://www.symfony.com/blog/ it works fine, but if I use https://whatismyipaddress.com/ip/1.3.3.7 it returns an empty string.

Why is this happening and how can I fix it?

Thanks

jLynx commented 5 years ago

Is this project dead?

larowlan commented 5 years ago

Nope

whatismyipaddress.com doesn't work because the markup is borked,

they have a self closing html tag

https://validator.w3.org/nu/?doc=https%3A%2F%2Fwhatismyipaddress.com%2Fip%2F1.3.3.7

jLynx commented 5 years ago

Is there anyway to bypass the validation check?

larowlan commented 5 years ago

There is some html that \DomDocument (built into PHP) just barfs on.

This is one of those cases.

Goutte uses symfony/domcrawler which is an abstraction on top of \DomDocument.

Under the hood that uses (from memory) libxml2. If it can't parse it, nothing we can do.

Contact the site and tell them to fix their stuff?

jLynx commented 5 years ago

Thanks :+1: