Closed virtuman closed 8 years ago
No, I don't have such plan, as I think it should be able to be done out-of-box, without the support from pyspider itself. It's more like a operations step when been deployed.
the concern is that we don't know the capacity and possible throughput of pyspider since we don't know what step takes most resources in order to optimize it.
For example - we have approximately 70 projects we are able to process approximately 1,300,000 pages per day on a server: 32 cores, 128gb ram, ssd disks, couple of lan cards and 300 proxy servers
what we don't know is that:
Really resource utilization logging is the only missing part from pyspider as it stands today. It's a great platform, very fast and very extensible, but without transparency into what happens behind the scenes - it is not likely to get picked up by many large organizations as the platform of their choice.
As far as what it takes to implement it - it can be easily done with the backwards compatibility: add config option for log_driver and log_options. Have only one push driver originally, we can contribute more plugins for it as soon as you have it. if they logging params are specified, than push data to end point in POST or JSON format. End point listener should already be a separate micro/service - ie elasticsearch, file, fluentd, sysdig, etc.. All these fun tools have interfaces like kibana, graylog2/grafana, and so many more.
As an improved alternative - all these "messages" could be pushed to REDIS or get aggregated in memory to be offloaded in chunks when chunk reaches nn entries or every yy seconds
I could work with you on making initial plugin, but i don't have enough pythong skills to be able to build it myself with support of all modes that you currently have (ie. standalone vs threaded vs docker containers) as in threaded you need to be well versed with python to know how to measure the resources from multiple threads originated from same PID.
Alex
Hi Alex,
Thanks for the comment.
For current stage of pyspider, you can find following information from dashboard:
assert
in your script to detect template changingWith the design of pyspider, you can have as many as you need of fetcher/processor/result_worker.
For other resources -- message queue, from my expertise, except for the latency, pyspider should not able to use the capability to the limit. And the latency could be solved by more instance.
The database could be the bottleneck. I'm not talking about the connection time, which can be solve by more result_works or more threads in scheduler. I'm talking about the query/insert/update the database. Current impalement of scheduler, every task operation (including check restart, insert, task start, set finish status, etc) directly executed on the database.
So, my suggestion is:
self.crawl
If you think anything important that should be included in the log, don't hesitate open a request. Or if you need any help around your running internal system, you can contact me personally as well.
Do you have any plans for adding any sort of resource utilization visualization tools?
this is a similar project and they have kafka, grafana integrated very nicely: https://github.com/estin/pomp-craigslist-example https://drive.google.com/file/d/0BzRf6g_VWuIjZDUxMGc1Q1ZScFk/view?usp=sharing https://bitbucket.org/estin/pomp/src/tip/examples/
is this something that you have in plans or not planned for any time soon?