Stats, resource utilization and monitoring

virtuman commented 8 years ago

Do you have any plans for adding any sort of resource utilization visualization tools?

this is a similar project and they have kafka, grafana integrated very nicely: https://github.com/estin/pomp-craigslist-example https://drive.google.com/file/d/0BzRf6g_VWuIjZDUxMGc1Q1ZScFk/view?usp=sharing https://bitbucket.org/estin/pomp/src/tip/examples/

is this something that you have in plans or not planned for any time soon?

binux commented 8 years ago

No, I don't have such plan, as I think it should be able to be done out-of-box, without the support from pyspider itself. It's more like a operations step when been deployed.

virtuman commented 8 years ago

the concern is that we don't know the capacity and possible throughput of pyspider since we don't know what step takes most resources in order to optimize it.

For example - we have approximately 70 projects we are able to process approximately 1,300,000 pages per day on a server: 32 cores, 128gb ram, ssd disks, couple of lan cards and 300 proxy servers

what we don't know is that:

is one of the projects not implemented well or can be optimized and it utilizes much more resources than all other projects?
what part of the project (on a per project basis) is the slowest (ie may be able to optimize if knew where the problem is)
if we add more servers - what part do we need to offload - processors, result_workers, etc. because we don't know which one needs more help than others
time it takes to retrieve page - possibly just the site is slow and we can't pull more info, or check if maybe we are now getting throttled if up until today we had 100 times the speed on retrieving the page
time it takes to store data from result_worker - ie - connection to DB server is overloaded
time it takes to pull job from queue - maybe redis is misconfigured or overutilized and takes too long to serve out the queue
time it takes to update tasks and results collections to see if maybe some collections have grown to large or indexes got corrupted and we don't know about the changes.
how many times each process was fired today and to be able to compare it to previous days, weeks, etc. to monitor fluctuations - possibly got blocked and receive a lot more HTTP errors than ever before
how many errors (with error content / stacktrace) happened and logged them all
and much much more actionable info can be acquired and utilized to turn pyspider into a next-gen platform.

Really resource utilization logging is the only missing part from pyspider as it stands today. It's a great platform, very fast and very extensible, but without transparency into what happens behind the scenes - it is not likely to get picked up by many large organizations as the platform of their choice.

As far as what it takes to implement it - it can be easily done with the backwards compatibility: add config option for log_driver and log_options. Have only one push driver originally, we can contribute more plugins for it as soon as you have it. if they logging params are specified, than push data to end point in POST or JSON format. End point listener should already be a separate micro/service - ie elasticsearch, file, fluentd, sysdig, etc.. All these fun tools have interfaces like kibana, graylog2/grafana, and so many more.

As an improved alternative - all these "messages" could be pushed to REDIS or get aggregated in memory to be offloaded in chunks when chunk reaches nn entries or every yy seconds

I could work with you on making initial plugin, but i don't have enough pythong skills to be able to build it myself with support of all modes that you currently have (ie. standalone vs threaded vs docker containers) as in threaded you need to be well versed with python to know how to measure the resources from multiple threads originated from same PID.

Alex

binux commented 8 years ago

Hi Alex,

Thanks for the comment.

For current stage of pyspider, you can find following information from dashboard:

queue size between components, which can tell you which part of pyspider is the bottleneck
average of fetching/processing time for each project, which you can find out if a site is slow
number of success/retry/failed for each project tells the error rate, you can even use assert in your script to detect template changing
error log, stacktrack for each task.

With the design of pyspider, you can have as many as you need of fetcher/processor/result_worker.

Fetcher is using async connection to make HTTP requests, which means it doesn't matter if the site is slow.
Processor can deploy on different boxes when lack of CPU resources.

For other resources -- message queue, from my expertise, except for the latency, pyspider should not able to use the capability to the limit. And the latency could be solved by more instance.

The database could be the bottleneck. I'm not talking about the connection time, which can be solve by more result_works or more threads in scheduler. I'm talking about the query/insert/update the database. Current impalement of scheduler, every task operation (including check restart, insert, task start, set finish status, etc) directly executed on the database.

So, my suggestion is:

use the information on dashboard, if most of queue size is zero or not reach the maximum limit, the system is just working good.
avoid frequently submit a lot of links via self.crawl
if you want to collect logs from pyspider, it's using standard python.logging model which can be easily collected with many third-part libs, e.g. sentry, python-logstash, graypy
you can also monitor the usage of each components with kinda Nagios, Cacti etc.

If you think anything important that should be included in the log, don't hesitate open a request. Or if you need any help around your running internal system, you can contact me personally as well.

binux / pyspider

Stats, resource utilization and monitoring #430