binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.48k stars 3.69k forks source link

Stats, resource utilization and monitoring #430

Closed virtuman closed 8 years ago

virtuman commented 8 years ago

Do you have any plans for adding any sort of resource utilization visualization tools?

this is a similar project and they have kafka, grafana integrated very nicely: https://github.com/estin/pomp-craigslist-example https://drive.google.com/file/d/0BzRf6g_VWuIjZDUxMGc1Q1ZScFk/view?usp=sharing https://bitbucket.org/estin/pomp/src/tip/examples/

is this something that you have in plans or not planned for any time soon?

binux commented 8 years ago

No, I don't have such plan, as I think it should be able to be done out-of-box, without the support from pyspider itself. It's more like a operations step when been deployed.

virtuman commented 8 years ago

the concern is that we don't know the capacity and possible throughput of pyspider since we don't know what step takes most resources in order to optimize it.

For example - we have approximately 70 projects we are able to process approximately 1,300,000 pages per day on a server: 32 cores, 128gb ram, ssd disks, couple of lan cards and 300 proxy servers

what we don't know is that:

  1. is one of the projects not implemented well or can be optimized and it utilizes much more resources than all other projects?
  2. what part of the project (on a per project basis) is the slowest (ie may be able to optimize if knew where the problem is)
  3. if we add more servers - what part do we need to offload - processors, result_workers, etc. because we don't know which one needs more help than others
  4. time it takes to retrieve page - possibly just the site is slow and we can't pull more info, or check if maybe we are now getting throttled if up until today we had 100 times the speed on retrieving the page
  5. time it takes to store data from result_worker - ie - connection to DB server is overloaded
  6. time it takes to pull job from queue - maybe redis is misconfigured or overutilized and takes too long to serve out the queue
  7. time it takes to update tasks and results collections to see if maybe some collections have grown to large or indexes got corrupted and we don't know about the changes.
  8. how many times each process was fired today and to be able to compare it to previous days, weeks, etc. to monitor fluctuations - possibly got blocked and receive a lot more HTTP errors than ever before
  9. how many errors (with error content / stacktrace) happened and logged them all
  10. and much much more actionable info can be acquired and utilized to turn pyspider into a next-gen platform.

Really resource utilization logging is the only missing part from pyspider as it stands today. It's a great platform, very fast and very extensible, but without transparency into what happens behind the scenes - it is not likely to get picked up by many large organizations as the platform of their choice.

As far as what it takes to implement it - it can be easily done with the backwards compatibility: add config option for log_driver and log_options. Have only one push driver originally, we can contribute more plugins for it as soon as you have it. if they logging params are specified, than push data to end point in POST or JSON format. End point listener should already be a separate micro/service - ie elasticsearch, file, fluentd, sysdig, etc.. All these fun tools have interfaces like kibana, graylog2/grafana, and so many more.

As an improved alternative - all these "messages" could be pushed to REDIS or get aggregated in memory to be offloaded in chunks when chunk reaches nn entries or every yy seconds

I could work with you on making initial plugin, but i don't have enough pythong skills to be able to build it myself with support of all modes that you currently have (ie. standalone vs threaded vs docker containers) as in threaded you need to be well versed with python to know how to measure the resources from multiple threads originated from same PID.

Alex

binux commented 8 years ago

Hi Alex,

Thanks for the comment.

For current stage of pyspider, you can find following information from dashboard:

With the design of pyspider, you can have as many as you need of fetcher/processor/result_worker.

For other resources -- message queue, from my expertise, except for the latency, pyspider should not able to use the capability to the limit. And the latency could be solved by more instance.

The database could be the bottleneck. I'm not talking about the connection time, which can be solve by more result_works or more threads in scheduler. I'm talking about the query/insert/update the database. Current impalement of scheduler, every task operation (including check restart, insert, task start, set finish status, etc) directly executed on the database.


So, my suggestion is:

If you think anything important that should be included in the log, don't hesitate open a request. Or if you need any help around your running internal system, you can contact me personally as well.