Open tony-schumacher opened 8 years ago
I'm having a similar problem. Have you found a solution?
I swtiched to phantom js and build the tool on my own.
@TonySchu are you able to share that solution?
@KClough I used npm phantom. But you will get the same problem, so you need to build the loop on your own.
Here is what I am doing:
Problems: If you just use one instance, you can't use it for 500URLS or so... Phantom is a little buggy, so I close the 1. after 10loops and start from new. Some other people are also doing it this way and it is working really good.
Also make sure the store the pid of each phantom in an array, so you can destroy the process if there is a bug. Otherwise you will get a memory leak pretty fast.
A little ugly, but I hope it helps (don't mind 'webshot' in my code... I just did not refactor it.):
`
// api for frontend app
app.post('/api/webshot', function (req, res) {
var crawlStatus = {index: 0, max: req.body.length};
initPhantom(req.body, crawlStatus);
res.send("image crawler is running");
});
//create brwoser instance
function initPhantom(todos, crawlStatus) {
phantom.create(['--ignore-ssl-errors=no', '--load-images=true'], {logLevel: 'error'})
.then(function (instance) {
console.log("===================> instance: ", instance.process.pid);
phantomChildren.push(instance.process.pid);
webshot(0, todos, instance.process.pid, instance, crawlStatus);
}).catch(function (e) {
console.log('Error in initPhantom', e);
errorCounts.push(e);
totalErrors.push(e);
killProcesses();
});
}
// create tab in brwoser and make screenshot
function webshot(id, shots, processId, phInstance, crawlStatus) {
// avoid too much ram and memory leak in phantom
if (id >= 10) {
phInstance.exit();
restartIfError(id, shots, null, crawlStatus)
} else {
phInstance.createPage().then(function (page, error) {
if (error) {
console.log("first", error);
}
page.property('viewportSize', {width: 1024, height: 768});
page.setting("resourceTimeout", 7000);
return page.open(shots[id].url)
.then(function (status) {
setTimeout(function () {
//get content html
var content = page.property('content');
return content
.then(function (content) {
// screenhots
console.log("render %s / %s", id + 1, shots.length, "processId:", processId);
crawlStatus.index += 1;
var image = 'temp_img/' + shots[id]._id + '.png';
page.render(image, {format: 'png', quality: '30'})
.then(function (finished, error) {
if (error) {
console.log(error)
}
page.close();
makeImageFromUrl(shots[id], image, content, crawlStatus);
if (id < shots.length - 1) {
id += 1;
webshot(id, shots, processId, phInstance, crawlStatus);
} else {
console.log("===================> all done: %s files has been written", shots.length, "processId:", processId, "user:", shots[id].user);
phInstance.exit();
}
}).catch(function (e) {
console.log("last before - processID: ", processId, e);
restartIfError(id, shots, processId, crawlStatus)
});
});
}, 5000);
}).catch(function (e) {
console.log("last one", e);
restartIfError(id, shots, processId, crawlStatus)
})
});
}
}
function restartIfError(id, shots, p_id, crawlStatus) {
if (p_id) {
try {
console.log("try to kill: ", p_id);
process.kill(p_id)
} catch (err) {
//
}
}
console.log("Restart webshot");
shots = shots.slice(id);
initPhantom(shots, crawlStatus);
}`
Interesting, I've had this problem on another web scraper based project. I'm glad to hear its not just me seeing memory leaks.
Thanks for this code. This is very helpful.
I am using webshot for creating a lot of screenshots from different URLs in a loop. To do this, I am calling it only one after another. The problem is, it will only work for around 5-6users at once, or the node server will freeze.
I noticed that webshot is very cpu intensive an will consume 100% of a small EC2 instance easily.
Is there a way to make it even more cpu friendly or maybe use the same phantom-proces for each url?
Thanks a lot!