duzun / hQuery.php

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
https://duzun.me/playground/hquery
MIT License
361 stars 74 forks source link

This page source works in your playground example but I can't get it to otherwise #57

Open chimmel opened 4 years ago

chimmel commented 4 years ago

$doc = hQuery::fromUrl('https://www.realtor.com/'); works, but any actual property URL throws ( ! ) Notice: Undefined offset: 2 in /home/vendor/duzun/hquery/src/hQuery.php on line 1152 but properties' urls work in your playground example runner? What am I missing?

duzun commented 4 years ago

You have a Notice, not Error or even Warning - nothing serious.

Try adding error_reporting(E_ALL & ~E_NOTICE); at the begining of the file.

I'll look into the cause of this Notice, nonetheless.

duzun commented 4 years ago

Looks like for some reason the response from the server starts with: HTTP/2 200

while the library expects it to starts with something like: HTTP/1.1 200 OK

The notice is about the missing status message "OK" - nothing wrong actually.

chimmel commented 4 years ago

Thanks for the input. How do you get that though? HTTP/2 vs. HTTP/1.1? When I try in your playground it says STATUS: 200 OK

chimmel commented 4 years ago

I added error_reporting(E_ALL & ~E_NOTICE); and var_dump($doc); returns /home/classes/PropertyPage.php:39:boolean false

duzun commented 4 years ago

hQuery::fromUrl uses my implementation of the HTTP through fsockopen(). I do not recommend it for advanced use cases. It is there originally for the ease of use and to avoid extra dependencies. But I recommend everybody to use some other method for fetching the HTML and feed it to hQuery::fromHTML($html, $url=null). See #26 for some details and the README.

In any case, hQuery::fromUrl returns FALSE for any HTTP response but 200. Try var_dump(hQuery::$last_http_result) to get some details on the HTTP response. Or even better, use a PSR-7 library to fetch the document.