binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.49k stars 3.69k forks source link

phantomjs doesn't release memory #207

Open xesina opened 9 years ago

xesina commented 9 years ago

I'm using a vps with 2G of ram with one project in pyspider, after a few minutes of crawling phantoms is filling the memory and won't release even after stopping the project.

binux commented 9 years ago

What do you mean of "stop the project"? stop pyspider or via webui? What's the version of phantomjs you are using?

xesina commented 9 years ago

stop the project from web ui. phantomjs --version 1.9.8

binux commented 9 years ago

Stop projects would not make phantomjs release memory. I have did things to release resources after request finished. But there is no certain way to make sure it did (google "phantomjs memory").

I have a phantomjs instance with crawl rate of 5 pages per minute, it cost ~500MB. And my solution is restart it every hour.

xesina commented 9 years ago

how can i restart phantom? thanks, I will check phantomjs_fetcher.js.

binux commented 9 years ago

if you are running pyspider use command pyspider: ps && kill

xesina commented 9 years ago

As you said in js file, there is a memory leak that resolved by : page.settings.loadImages = true;

binux commented 9 years ago

Does that work? I does not believe that a normal function can lead to memory leak and not fixed!

xesina commented 9 years ago

Yes, after changing that parameter its just working fine!

binux commented 9 years ago

I will make it true by default.

laoyuan commented 9 years ago

but load Images take too much time

laoyuan commented 9 years ago

@binux what's the proper posture to restart PhantomJS every hour? I run it by: nohup pyspider -c /usr/local/etc/pyspider.json phantomjs &

binux commented 9 years ago

manage instance with some sort of Process Control System like supervisord.org, then just kill it every hour.