FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Crawler doesn't contain html #301

Open kironet opened 7 years ago

kironet commented 7 years ago

Hey,

I fixed my previous code. But now, I have a problem with crawler. It doesn't contain any html.

But when I'm dumping $client->getResponse(), I'm getting HTML in it.

        $client = new Client();
        $client->setHeader('user-agent', "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36");
        $crawler = $client->request('GET', 'https://wyobiz.wy.gov/Business/FilingSearch.aspx');
        $form = $crawler->selectButton('Search')->form();
        $domDocument = new \DOMDocument;
        $input = $domDocument->createElement('input');
        $input->setAttribute('name', '__ASYNCPOST');
        $input->setAttribute('value', 'true');
        $formInput = new InputFormField($input);
        $form->set($formInput);
        $crawler = $client->submit($form, array(
            'ctl00$MainContent$txtFilingName' => 'Google',
        ));
        $response = $client->getResponse();
        var_dump($crawler);

crawler dump:

object(Symfony\Component\DomCrawler\Crawler)#196 (7) {
  ["uri":protected]=>
  string(48) "https://wyobiz.wy.gov/Business/FilingSearch.aspx"
  ["defaultNamespacePrefix":"Symfony\Component\DomCrawler\Crawler":private]=>
  string(7) "default"
  ["namespaces":"Symfony\Component\DomCrawler\Crawler":private]=>
  array(0) {
  }
  ["baseHref":"Symfony\Component\DomCrawler\Crawler":private]=>
  string(48) "https://wyobiz.wy.gov/Business/FilingSearch.aspx"
  ["document":"Symfony\Component\DomCrawler\Crawler":private]=>
  NULL
  ["nodes":"Symfony\Component\DomCrawler\Crawler":private]=>
  array(0) {
  }
  ["isHtml":"Symfony\Component\DomCrawler\Crawler":private]=>
  bool(true)
}

response dump:

object(Symfony\Component\BrowserKit\Response)#247 (3) {
  ["content":protected]=>
  string(47340) "1|#||4|25433|updatePanel|MainContent_UpdatePanel1|HTML_HERE......
  ["status":protected]=>
  int(200)
  ["headers":protected]=>
  array(6) {
    ["Cache-Control"]=>
    array(1) {
      [0]=>
      string(7) "private"
    }
    ["Content-Type"]=>
    array(1) {
      [0]=>
      string(25) "text/plain; charset=utf-8"
    }
    ["X-AspNet-Version"]=>
    array(1) {
      [0]=>
      string(9) "4.0.30319"
    }
    ["X-Powered-By"]=>
    array(1) {
      [0]=>
      string(7) "ASP.NET"
    }
    ["Date"]=>
    array(1) {
      [0]=>
      string(29) "Wed, 22 Mar 2017 10:55:17 GMT"
    }
    ["Content-Length"]=>
    array(1) {
      [0]=>
      string(5) "47340"
    }
  }
}

What's wrong?

stof commented 7 years ago

Well, the content in the Response is not valid HTML. so the HTML parsing fails

kironet commented 7 years ago

@stof It's returning just <section></section> that's what I need. Can I somehow add doctype>html>head>/head>body> <section></section> >/body>/html

stof commented 7 years ago

Can you fix your comment to use a markdown codeblock around the code ? I think the rendering stripped some content after just (I don't understand your comment otherwise)

kironet commented 7 years ago

@stof fixed...

You said it's not valid HTML. $response->getContent() is returning <section>some info here</section> To be valid it should be

<!doctrype html>
<html>
<head></head>
<body>
<section>some info here</section>
</body>
</html>

Am I right?

stof commented 7 years ago

No, in the dump avoid, content is 1|#||4|25433|updatePanel|MainContent_UpdatePanel1|HTML_HERE, meaning there are extra stuff at the beginning of the response, making it invalid HTML. The response you receive is a text/plain, not a text/html one.