binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.47k stars 3.69k forks source link

JSON export should be a valid JSON file #23

Closed halflings closed 9 years ago

halflings commented 9 years ago

Currently, doing a JSON dump gives a list of new-line separated JSON maps. I think it should be a list of JSON maps, like [{'key':'value', ...}, {'key':'value', ...}, ...], to facilitate its reuse.

binux commented 9 years ago

It's hard to make a big list of JSON object in memory. It need a streaming JSON lib (like https://github.com/dominictarr/JSONStream) to make it possible.

halflings commented 9 years ago

True, I guess producing the list is easy to do "manually" but then it would still be problematic for consumption.

On Mon, 17 Nov 2014 10:18 足兆叉虫 notifications@github.com wrote:

As for performance, it's hard to make a big list of JSON object in memory. It need a streaming JSON lib (like https://github.com/dominictarr/JSONStream) to make it possible.

— Reply to this email directly or view it on GitHub https://github.com/binux/pyspider/issues/23#issuecomment-63277659.

binux commented 9 years ago

The best practice is writing a result worker. Results will been sent in real time. Export from web UI is just a preview. (OK, to be honest, I haven't thought it through carefully)

halflings commented 9 years ago

Sure, but I think what you did so far is actually good (better than my suggestion at least)

Ahmed Kachkach Kachkach.com

Computer Science student at KTH Royal Institute of Technology (Sweden) *and INSA de Lyon (France), Master degrees *Étudiant en ingénierie informatique à KTH (Suède) et à l'INSA de Lyon, 5ème année

On Mon, Nov 17, 2014 at 12:46 PM, 足兆叉虫 notifications@github.com wrote:

The best practice is writing a result worker. Results will been sent in real time. Export from web UI is just a preview. (OK, to be honest, I haven't thought it through carefully)

— Reply to this email directly or view it on GitHub https://github.com/binux/pyspider/issues/23#issuecomment-63293714.

mavencode01 commented 9 years ago

Is it possible to write the result to a message queue in on_result callback ?

binux commented 9 years ago

@pkadetiloye Yes, of course! It's the default behavior. Override result_worker to handle the data.

mavencode01 commented 9 years ago

Great! Just stumble on this awesome crawler yesterday...and got it working using Vagrant.

If am not using a docker, how do I specify the configuration for rabbitMQ, phantomjs or external mysql ?

Thank you Binux

binux commented 9 years ago

@pkadetiloye Refer to the docker file, like this: https://registry.hub.docker.com/u/dockerfile/mysql/ You may change the docker image or build your self if needed.

Glad to help.

mavencode01 commented 9 years ago

@binux, am not using docker LXC containers. Just want to set up the crawler on my box. run.py reads some environmental variables but is there any documentation on how to setup the environmental variables for different parameters ?

e.g. scheduler_xmlrpc_port = int(os.environ.get('SCHEDULER_XMLRPC_PORT', 23333)) fetcher_xmlrpc_port = int(os.environ.get('FETCHER_XMLRPC_PORT', 24444)) phantomjs_proxy_port = int(os.environ.get('PHANTOMJS_PROXY_PORT', 25555)) webui_host = os.environ.get('WEBUI_HOST', '0.0.0.0') webui_port = int(os.environ.get('WEBUI_PORT', 5000)) debug = bool(os.environ.get('DEBUG', False)) queue_maxsize = int(os.environ.get('QUEUE_MAXSIZE', 100)) demo_mode = bool(os.environ.get('DEMO_MODE')) ...

Thanks

binux commented 9 years ago

@pkadetiloye I'm not decided how to do this. I want make a single run.py for local, docker and your case. Currently, you may setup the envs you need base on code, or just change the source code.

binux commented 9 years ago

If you have any further ideas, please reopen this issue.