brenden / node-webshot

Easy website screenshots in Node.js
2.12k stars 285 forks source link

High CPU in loops #177

Open tony-schumacher opened 8 years ago

tony-schumacher commented 8 years ago

I am using webshot for creating a lot of screenshots from different URLs in a loop. To do this, I am calling it only one after another. The problem is, it will only work for around 5-6users at once, or the node server will freeze.

I noticed that webshot is very cpu intensive an will consume 100% of a small EC2 instance easily.

Is there a way to make it even more cpu friendly or maybe use the same phantom-proces for each url?

Thanks a lot!

oraneedwards commented 7 years ago

I'm having a similar problem. Have you found a solution?

tony-schumacher commented 7 years ago

I swtiched to phantom js and build the tool on my own.

KClough commented 7 years ago

@TonySchu are you able to share that solution?

tony-schumacher commented 7 years ago

@KClough I used npm phantom. But you will get the same problem, so you need to build the loop on your own.

Here is what I am doing:

  1. Create a phantom instance
  2. create a browser tab
  3. open your url
  4. wait
  5. get HTML content
  6. get screenshot
  7. close the browser tab (not phantom) --> go back to 2. and repeat -->if loop count > 10, close phantom and go back to 1. -->if error try to kill the process with the pid from 1. and start from 1.

Problems: If you just use one instance, you can't use it for 500URLS or so... Phantom is a little buggy, so I close the 1. after 10loops and start from new. Some other people are also doing it this way and it is working really good.

Also make sure the store the pid of each phantom in an array, so you can destroy the process if there is a bug. Otherwise you will get a memory leak pretty fast.

A little ugly, but I hope it helps (don't mind 'webshot' in my code... I just did not refactor it.): `
// api for frontend app app.post('/api/webshot', function (req, res) { var crawlStatus = {index: 0, max: req.body.length}; initPhantom(req.body, crawlStatus); res.send("image crawler is running"); });

//create brwoser instance
function initPhantom(todos, crawlStatus) {
    phantom.create(['--ignore-ssl-errors=no', '--load-images=true'], {logLevel: 'error'})
        .then(function (instance) {
            console.log("===================> instance: ", instance.process.pid);
            phantomChildren.push(instance.process.pid);
            webshot(0, todos, instance.process.pid, instance, crawlStatus);
        }).catch(function (e) {
        console.log('Error in initPhantom', e);
        errorCounts.push(e);
        totalErrors.push(e);
        killProcesses();
    });
}

// create tab in brwoser and make screenshot
function webshot(id, shots, processId, phInstance, crawlStatus) {
    // avoid too much ram and memory leak in phantom
    if (id >= 10) {
        phInstance.exit();
        restartIfError(id, shots, null, crawlStatus)
    } else {
        phInstance.createPage().then(function (page, error) {
            if (error) {
                console.log("first", error);
            }
            page.property('viewportSize', {width: 1024, height: 768});
            page.setting("resourceTimeout", 7000);
            return page.open(shots[id].url)
                .then(function (status) {
                    setTimeout(function () {
                        //get content html
                        var content = page.property('content');
                        return content
                            .then(function (content) {
                                // screenhots
                                console.log("render %s / %s", id + 1, shots.length, "processId:", processId);
                                crawlStatus.index += 1;
                                var image = 'temp_img/' + shots[id]._id + '.png';
                                page.render(image, {format: 'png', quality: '30'})
                                    .then(function (finished, error) {
                                        if (error) {
                                            console.log(error)
                                        }
                                        page.close();
                                        makeImageFromUrl(shots[id], image, content, crawlStatus);
                                        if (id < shots.length - 1) {
                                            id += 1;
                                            webshot(id, shots, processId, phInstance, crawlStatus);
                                        } else {
                                            console.log("===================> all done: %s files has been written", shots.length, "processId:", processId, "user:", shots[id].user);
                                            phInstance.exit();
                                        }
                                    }).catch(function (e) {
                                    console.log("last before - processID: ", processId, e);
                                    restartIfError(id, shots, processId, crawlStatus)
                                });
                            });
                    }, 5000);
                }).catch(function (e) {
                    console.log("last one", e);
                    restartIfError(id, shots, processId, crawlStatus)
                })
        });
    }
}

function restartIfError(id, shots, p_id, crawlStatus) {
    if (p_id) {
        try {
            console.log("try to kill: ", p_id);
            process.kill(p_id)
        } catch (err) {
            //
        }
    }
    console.log("Restart webshot");
    shots = shots.slice(id);
    initPhantom(shots, crawlStatus);
}`
KClough commented 7 years ago

Interesting, I've had this problem on another web scraper based project. I'm glad to hear its not just me seeing memory leaks.

Thanks for this code. This is very helpful.