duzun / hQuery.php

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
https://duzun.me/playground/hquery
MIT License
361 stars 74 forks source link

Make faster when looping through many pages #37

Closed neverender24 closed 6 years ago

neverender24 commented 6 years ago

Hi I have successfully used your package and it is really great, upon using it with only one url it's kinda fast, but when I try to use it with multple url through looping, 50 url can take upto 10minutes.

How do I make it work faster?

duzun commented 6 years ago

There must be one of two causes:

  1. network/transport time,
  2. processing time.

I think it is most likely to be the first reason, which could be optimized by using compression (gzip = true) and caching of repeated requests. Maybe you know of any other ways to optimize the transportation of HTML. It doesn't have to be done with hQuery, there are plenty of good libraries to make HTTP requests (see cURL for eg.).

In order to optimize the processing time, you could use faster machine/CPU, maybe more RAM if there is not enough of it, or use parallel processing on multicore, for eg. by processing 8 documents at a time on an 8 core processor.

The library could be optimized too, but it is not an easy task. I've already did my best when I wrote it many years ago.

Here is an example of how to measure the time:

hQuery::$cache_path = '/path/to/cache/'; // this would enable cacheing of repeated requests on FS

$doc = hQuery::fromUrl(
    $url
  , [
        'Accept'     => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'User-Agent' => 'MyFancyBot v0.1',
    ]
  , NULL
  , ['decode' => 'gzip'] // use compression
);

$read_time   = round($doc->read_time); // milliseconds
$index_time  = round($doc->index_time); // milliseconds

$select_time = microtime(true);
$elements = $doc->find($sel);
$select_time = round((microtime(true) - $select_time) * 1e6); // microseconds

$doc_size = $doc->size;

Next, you should draw some conclusions as of where is the bottleneck.

neverender24 commented 6 years ago

Thank you for the feedback.

Do the response time of the scraped website also matters with the speed?

neverender24 commented 6 years ago

I basically done the looping like below, Idk if I'm doing the fastest thing though.

 $config = [
        'user_agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
        'accept_html' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    ];

    // Enable cache
    hQuery::$cache_path = sys_get_temp_dir() . '/hQuery/';

    for($x = $from; $x <= $to; $x++){

    $url = "https://somewebsites.com/?id=999-".$x."search_results";

    $go  = @$_POST['go']  ?: @$_GET['go'];
    $rm = strtoupper(getenv('REQUEST_METHOD') ?: $_SERVER['REQUEST_METHOD']);
    // var_export(compact('url', 'sel', 'go')+[$rm]+$_SERVER);
    if ( $rm == 'POST' ) {

        // Results acumulator
        $return = array();

        foreach($sels as $sel){

            // If we have $url to parse and $sel (selector) to fetch, we a good to go
            if($url && $sel) {
                try {
                    $doc = hQuery::fromUrl(
                        $url
                    , [
                            'Accept'     => $config['accept_html'],
                            'User-Agent' => $config['user_agent'],
                        ], 
                        NULL,
                         ['decode' => 'gzip'] // use compression
                    );
                    if($doc) {
                        // Read some meta info from $doc
                        $t = $doc->find('head title') and $t = trim($t->text()) and $meta['title'] = $t;
                        $t = $doc->find('head meta');
                        if ( $t ) foreach($t as $k => $v) {
                            switch($v->attr('name')) {
                                case 'description': {
                                    $t = trim($v->attr('content')) and $meta['description'] = $t;
                                } break;
                                case 'keywords': {
                                    $t = trim($v->attr('content')) and $meta['keywords'] = $t;
                                } break;
                            }
                        }
                        if ( $t = $doc->headers ) {
                            $b = array();
                            foreach($t as $k => $v) $b[$k] = "$k: " . (is_array($v) ? implode(PHP_EOL, $v) : $v);
                            $meta['headers'] = $b = implode(PHP_EOL, $b);
                        }
                        $select_time = microtime(true);

                        $elements = $doc->find($sel);
                        $select_time = microtime(true) - $select_time;
                        $return['select_time'] = $select_time;

                        if(is_array($elements) || is_object($elements)){
                        $return['elements_count'] = count($elements);
                        }

                        if(is_array($elements) || is_object($elements)){

                        foreach($elements as $pos => $el){

                            if($el->text()!==null){

                                $str = preg_replace('/(\v|\s)+/', ' ', $el->text());

                                }

                            }

                        }

                        }
                    }
                    else {
                        $return['request'] = hQuery::$last_http_result;
                    }
                }
                catch(Exception $ex) {
                    $error = $ex;
                }
            }

        } //end foreach loop

    }

    }//end for loop
duzun commented 6 years ago

Request/response time usually matters the most. For a fast (PHP) website the response is usually below 200 ms, but for a slow website it is above 1 sec., and sometimes could be even more than 10 sec.

Here is a simple example I run on my server:

URL: https://www.keramspb.ru/italyanskaya_plitka.html Size: 208.5 kb Read/Response Time: 260 ms Index Time: 67 ms Select Time for "a > img:parent": 1 ms (found 65 items) Select Time for "span": 157 μs (found 792 items)

You can notice that response take the longest time, and yet this is a fast website. Also the simpler the selector, the faster it selects, but the difference is negligible (~ 1 ms here).

Unfortunately, you can't control the response time. The only thing I can think of is to cache the result, but that saves you time only on the second time you make the same request.

Note (a small improvement):

// Bad
if ( $el->text() !== null ) { // ->text() always returns string, never NULL!
    $str = process($el->text());
}

// Better
$t = $el->text();
if ( $t !== "" ) {
    $str = process($t);
}

->text() and many other method usually do some computing, thus take some time, thus it is better to avoid invoking same method repeatedly when possible. It is mostly true for any library.

neverender24 commented 6 years ago

Thank you for making it clear.