Open laapsaap opened 9 years ago
What is the status of message queue(by checking the message number in each queue via web interface of rabbitmq) when crawl rate become stable.
Not sure what you mean.
If I check overview statistics
Queued Message (chart last minute) avg 150msg Ready
Message Rates (chart last minute) Publish 232/s Get (noack) 190/s
All 5 queues State is running. My crawl rate is 200.0/200.0 but my new task for 5m is 16800. So 16800/300 = 56/s
Please change the --queue-maxsize
to higher value (default is 100, change it to 1000 or 10000, don't change it to 0):
http://docs.pyspider.org/en/latest/Command-Line/#-queue-maxsize
--queue-maxsize
is a limitation of the size of message queue between componts. When the queue is full, front compont will wait some seconds for next compont.
I added "queue-maxsize": 1000 to config.json. But performance is the same, not much happened.
How about the statistics of rabbitmq? Is it still 150 msg ready?
You need change the config on every component instance on all of the servers.
All my pyspider process access the same config.json (shared file).
Can you show me this panel:
200.0/200.0 rate 5m: 17327 all: 34.9% pending 21mil
Try changing "queue-maxsize" to 0 to find out which queue is block up? I will do some test to find out the bottleneck too.
After some tests. I found that the maximum publish(put) rate of rabbitmq of current message queue implement using pika.BlockingConnection
is about 290/s (btw, ampq is much faster then pika: 9000/s, default used in python3), addition of the cost of database. (Scheduler will query the database for the detail of a task and put it to message queue) I think 80/s is explainable.
I will do more test and looking for a solution.
Yeah I am using queue-maxsize is 0 for a few days and its still 5m 15500. Its running quite stable at about 80mil crawled pages. But this is too slow, I will spend sometime today to see if I can improve it.
If you are using python2, try change Queue = PikaQueue to Queue = AmqpQueue
to force pyspider use libamqp when connect to rabbitmq, it's faster in my test.
NOTE: delete all the queues via webui before restart, the default parameter of amqp is different with pika
I am using python 2 so after changing queue to PikaQueue, it went from 15500/5m to 21500/5m. A small bump in performance.
But now I am going through a lot of ignore tasks, so I cant check the real performance increase.
I do see this now happening sometimes;
[E 150303 13:35:18 scheduler:405] Socket closed
Traceback (most recent call last):
File "/opt/pyspider/pyspider/scheduler/scheduler.py", line 400, in run
self.run_once()
File "/opt/pyspider/pyspider/scheduler/scheduler.py", line 385, in run_once
self._check_task_done()
File "/opt/pyspider/pyspider/scheduler/scheduler.py", line 206, in _check_task_done
task = self.status_queue.get_nowait()
File "/opt/pyspider/pyspider/libs/rabbitmq.py", line 31, in wrap
return func(self, *args, **kwargs)
File "/opt/pyspider/pyspider/libs/rabbitmq.py", line 246, in get_nowait
message = self.channel.basic_get(self.name, not ack)
File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1954, in basic_get
(60, 72), # Channel.basic_get_empty
File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait
self.channel_id, allowed_methods)
File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 240, in _wait_method
self.method_reader.read_method()
File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
raise m
IOError: Socket closed
[E 150303 13:35:19 rabbitmq:39] RabbitMQ error: error(32, 'Broken pipe'), reconnect.
[E 150303 13:35:19 rabbitmq:39] RabbitMQ error: error(32, 'Broken pipe'), reconnect.
I made more tests of taskdb, all test is running in a vagrant centos box.
mysql+taskdb://127.0.0.1:3306/taskdb
[I 150304 11:22:47 bench:60] taskdb insert 1000 [I 150304 11:22:48 bench:70] cost 1.63s, 614.96/s 1.63ms [I 150304 11:22:48 bench:73] taskdb update 1000 [I 150304 11:22:50 bench:83] cost 1.77s, 564.67/s 1.77ms [I 150304 11:22:50 bench:97] taskdb get 1000 randomly [I 150304 11:22:51 bench:111] cost 1.07s, 936.03/s 1.07ms [I 150304 11:22:51 bench:60] taskdb insert 10000 [I 150304 11:23:08 bench:70] cost 16.62s, 601.67/s 1.66ms [I 150304 11:23:08 bench:73] taskdb update 10000 [I 150304 11:23:26 bench:83] cost 18.11s, 552.24/s 1.81ms [I 150304 11:23:26 bench:97] taskdb get 10000 randomly [I 150304 11:23:37 bench:111] cost 10.78s, 927.44/s 1.08ms
sqlalchemy+mysql+mysqlconnector+taskdb://root@127.0.0.1:3306/taskdb
[I 150304 11:42:07 bench:60] taskdb insert 1000 [I 150304 11:42:08 bench:70] cost 1.16s, 859.96/s 1.16ms [I 150304 11:42:08 bench:73] taskdb update 1000 [I 150304 11:42:09 bench:83] cost 1.58s, 633.41/s 1.58ms [I 150304 11:42:09 bench:97] taskdb get 1000 randomly [I 150304 11:42:11 bench:111] cost 1.96s, 510.25/s 1.96ms [I 150304 11:42:11 bench:60] taskdb insert 10000 [I 150304 11:42:22 bench:70] cost 11.14s, 898.01/s 1.11ms [I 150304 11:42:22 bench:73] taskdb update 10000 [I 150304 11:42:37 bench:83] cost 14.94s, 669.31/s 1.49ms [I 150304 11:42:37 bench:97] taskdb get 10000 randomly [I 150304 11:42:57 bench:111] cost 19.35s, 516.71/s 1.94ms
sqlalchemy+mysql+mysqldb+taskdb://root@127.0.0.1:3306/taskdb
[I 150304 11:40:38 bench:60] taskdb insert 1000 [I 150304 11:40:38 bench:70] cost 0.92s, 1086.25/s 0.92ms [I 150304 11:40:38 bench:73] taskdb update 1000 [I 150304 11:40:40 bench:83] cost 1.22s, 817.67/s 1.22ms [I 150304 11:40:40 bench:97] taskdb get 1000 randomly [I 150304 11:40:41 bench:111] cost 1.50s, 666.52/s 1.50ms [I 150304 11:40:41 bench:60] taskdb insert 10000 [I 150304 11:40:50 bench:70] cost 9.14s, 1094.53/s 0.91ms [I 150304 11:40:50 bench:73] taskdb update 10000 [I 150304 11:41:03 bench:83] cost 12.54s, 797.42/s 1.25ms [I 150304 11:41:03 bench:97] taskdb get 10000 randomly [I 150304 11:41:19 bench:111] cost 15.77s, 634.21/s 1.58ms
mongodb+taskdb://127.0.0.1:27017/taskdb
[I 150304 11:25:52 bench:60] taskdb insert 1000 [I 150304 11:25:52 bench:70] cost 0.60s, 1668.70/s 0.60ms [I 150304 11:25:52 bench:73] taskdb update 1000 [I 150304 11:25:53 bench:83] cost 0.49s, 2049.47/s 0.49ms [I 150304 11:25:53 bench:97] taskdb get 1000 randomly [I 150304 11:25:54 bench:111] cost 0.95s, 1050.10/s 0.95ms [I 150304 11:25:54 bench:60] taskdb insert 10000 [I 150304 11:26:24 bench:70] cost 30.46s, 328.33/s 3.05ms [I 150304 11:26:24 bench:73] taskdb update 10000 [I 150304 11:26:36 bench:83] cost 11.47s, 872.05/s 1.15ms [I 150304 11:26:36 bench:97] taskdb get 10000 randomly [I 150304 11:27:13 bench:111] cost 37.16s, 269.13/s 3.72ms
sqlalchemy+postgresql+taskdb://postgres@127.0.0.1:5432/taskdb
[I 150304 11:51:07 bench:60] taskdb insert 1000 [I 150304 11:51:09 bench:70] cost 1.40s, 716.79/s 1.40ms [I 150304 11:51:09 bench:73] taskdb update 1000 [I 150304 11:51:10 bench:83] cost 1.72s, 579.79/s 1.72ms [I 150304 11:51:10 bench:97] taskdb get 1000 randomly [I 150304 11:51:12 bench:111] cost 1.58s, 632.50/s 1.58ms [I 150304 11:51:12 bench:60] taskdb insert 10000 [I 150304 11:51:26 bench:70] cost 14.38s, 695.59/s 1.44ms [I 150304 11:51:26 bench:73] taskdb update 10000 [I 150304 11:51:44 bench:83] cost 17.97s, 556.41/s 1.80ms [I 150304 11:51:44 bench:97] taskdb get 10000 randomly [I 150304 11:52:00 bench:111] cost 16.11s, 620.56/s 1.61ms
sqlite+taskdb:///data/taskdb
[I 150304 11:28:20 bench:60] taskdb insert 1000 [I 150304 11:28:25 bench:70] cost 4.55s, 219.62/s 4.55ms [I 150304 11:28:25 bench:73] taskdb update 1000 [I 150304 11:28:30 bench:83] cost 4.57s, 218.83/s 4.57ms [I 150304 11:28:30 bench:97] taskdb get 1000 randomly [I 150304 11:28:31 bench:111] cost 0.95s, 1050.34/s 0.95ms [I 150304 11:28:31 bench:60] taskdb insert 10000 [I 150304 11:29:16 bench:70] cost 45.24s, 221.05/s 4.52ms [I 150304 11:29:16 bench:73] taskdb update 10000 [I 150304 11:30:04 bench:83] cost 47.96s, 208.52/s 4.80ms [I 150304 11:30:04 bench:97] taskdb get 10000 randomly [I 150304 11:30:13 bench:111] cost 9.64s, 1037.19/s 0.96ms
sqlite+taskdb://
[I 150304 11:31:07 bench:60] taskdb insert 1000 [I 150304 11:31:07 bench:70] cost 0.12s, 8385.94/s 0.12ms [I 150304 11:31:07 bench:73] taskdb update 1000 [I 150304 11:31:08 bench:83] cost 0.22s, 4640.29/s 0.22ms [I 150304 11:31:08 bench:97] taskdb get 1000 randomly [I 150304 11:31:08 bench:111] cost 0.44s, 2258.38/s 0.44ms [I 150304 11:31:08 bench:60] taskdb insert 10000 [I 150304 11:31:09 bench:70] cost 1.20s, 8316.93/s 0.12ms [I 150304 11:31:09 bench:73] taskdb update 10000 [I 150304 11:31:11 bench:83] cost 2.17s, 4606.31/s 0.22ms [I 150304 11:31:11 bench:97] taskdb get 10000 randomly [I 150304 11:31:16 bench:111] cost 4.66s, 2143.95/s 0.47ms
"python run.py -c config.json bench"
[I 150306 10:09:24 bench:60] taskdb insert 1000 [I 150306 10:09:31 bench:70] cost 6.72s, 148.82/s 6.72ms [I 150306 10:09:31 bench:73] taskdb update 1000 [I 150306 10:09:38 bench:83] cost 6.83s, 146.48/s 6.83ms [I 150306 10:09:38 bench:97] taskdb get 1000 randomly [I 150306 10:09:39 bench:111] cost 1.16s, 858.78/s 1.16ms [I 150306 10:09:39 bench:60] taskdb insert 10000 [I 150306 10:10:41 bench:70] cost 62.45s, 160.13/s 6.24ms [I 150306 10:10:41 bench:73] taskdb update 10000 [I 150306 10:11:45 bench:83] cost 63.50s, 157.47/s 6.35ms [I 150306 10:11:45 bench:97] taskdb get 10000 randomly [I 150306 10:11:57 bench:111] cost 11.81s, 846.97/s 1.18ms [I 150306 10:11:57 bench:143] message queue put 1000 [I 150306 10:11:57 bench:152] cost 0.31s, 3187.43/s 0.31ms [I 150306 10:11:57 bench:155] message queue get 1000
definately bottleneck with inserts, I assume mysql being bottleneck cant imagine rabbitmq being so slow.
So the bottleneck was innodb insert/updates. I changed to tokudb as backend engine and turning off commit log's sync. Performance increased a lot. Problem with innodb is at some point the results table for project grows over XX gb and server keep wanting more memory. At some point for big crawls there is just no memory left.
[I 150306 15:31:00 bench:60] taskdb insert 1000 [I 150306 15:31:01 bench:70] cost 0.94s, 1060.44/s 0.94ms [I 150306 15:31:01 bench:73] taskdb update 1000 [I 150306 15:31:02 bench:83] cost 0.85s, 1179.29/s 0.85ms [I 150306 15:31:02 bench:97] taskdb get 1000 randomly [I 150306 15:31:03 bench:111] cost 1.11s, 897.00/s 1.11ms [I 150306 15:31:03 bench:60] taskdb insert 10000 [I 150306 15:31:11 bench:70] cost 7.58s, 1319.97/s 0.76ms [I 150306 15:31:11 bench:73] taskdb update 10000 [I 150306 15:31:19 bench:83] cost 8.40s, 1190.91/s 0.84ms [I 150306 15:31:19 bench:97] taskdb get 10000 randomly [I 150306 15:31:30 bench:111] cost 11.40s, 876.87/s 1.14ms [I 150306 15:31:31 bench:143] message queue put 1000 [I 150306 15:31:31 bench:152] cost 0.36s, 2777.79/s 0.36ms [I 150306 15:31:31 bench:155] message queue get 1000
For database. Current implementation of scheduler get every task by key one by one and feed to message queue, the performance could be better if we select multiple items once or cache some hot items into memory or use multiple threads. As for your case, you added tasks in seed page once, therefore the get should be the bottleneck.
For rabbitmq. The implementation is using the basic_publish
and basic_get
API of rabbitmq to imitate the built-in Queue
. The performance should be better using publisher and consumer pattern.
Very sick performance right now.
Database commit on every insert command is another weak point.
Yeah transaction/commit would definitely improve performance. Right now this is an empty database, so performance is quite high still. In my tests the performance decreases overtime when the database grow due to what you just said about the current implementation. I tried different tuning with innodb and tokudb, ~300/s is my maximum while benchmark do 1000+/s
scheduler needs get from message queue and get from database to check whether the task is crawled then insert into database. So it's slower than just insert in benchmark.
How is this possible that 1d count is higher than all count? I am getting good results replacing innodb with tokudb, lower memory and disk space usage.
The 1d
counter is the number of events while all
is the number of tasks.
Mouse over the progress bar you could find the different words used in the details.
I am getting a steady ~75000/5m nowadays. I tested Redis today, for me the performance was the same as RabbitMQ.
We all know the scheduler is the bottleneck, but I am not entirely sure what.
My database can deal with the qps easily, its about 700qps with low load. My mysql database is heavily tweaked for pyspider.
I run processors, fetchers and rabbitmq on different servers connected on 1gbit LAN. All of them are low load and low IO.
Then there's the one scheduler, running about 60% consistently on the CPU. So what is the bottleneck here? CPU? I find it weird that scheduler process is not maxing the CPU thread.
I tried everything. Thinking about running Scheduler on a even more powerful CPU to see how much it differ. I also wonder if mongodb will increase any performance..
For a new request operation(blue part), it needs 1 queue.get + 1 database.get + 1 database.insert/update operations. With your benchmark it needs approximately 0.36ms + 1.14ms + 0.76ms = 2.26ms = 442 qps. (NOTE: the requests with an available age (would not recrawl) is not counted)
For a success task operation(in green), it needs 1 database.get + 1 queue.put + 1 queue.get + 1 database.update operations. It needs 1.14ms + 0.31ms + 0.36ms + 0.76ms = 2.57ms = 389 qps. (it's limited to 200?)
As I had replied in https://github.com/binux/pyspider/issues/181#issuecomment-93629437. Scheduler is single thread, it have to handle the tasks one by one. Most of time it's waiting for the database or message queue to response.I had de-duplicate same tasks within one loop to reduce the number of tasks, and would try to use multi-thread or multi-processes to improve the performance.
What is limited to 200? I can improve cps (crawls per second) considerately by using multiple schedulers, its not that hard you only need to figure out in project code what links to save into a different database or table and feed a different scheduler those links. It does make project code a bit more complicated because you have control of the flow.
Is it an idea to be able to spawn multiple schedule and assign them to a project instead?
Here https://github.com/binux/pyspider/issues/138#issuecomment-78235424 is limited to 200.
That's not that easy to use multiple schedulers. Although you have figured out what links to save into different database, a dispatcher is needed to distribute the links into corresponding scheduler.
One more thing, the priority task queue and traffic control(token bucket) are not easy to separate into multiple scheduler, especially separate single project into multiple scheduler. So multi-thread is much easier to control and only the database process is parallel.
Oh yeah I increased the craw/burst rate to 500/500 nowadays, but only at a start of a new project I reach 100K+/5m afterwards stabilize at ~75K
With single thread, I think it's hard to improve... BTW: what's your expected QPS?
I just like pushing limits, but practically it would be nice to have steady 90K/5m so about 300qps. But theoretically there shouldnt be a limit :)
Indeed, its hard to increase performance in its current state. Right now the only way for me is to run multiple pyspider instances. But this is not for everyone, many sites cant handle 300+ hits/sec (DoS) and proxies list will get saturated very quickly.
I will spend some time in the code soon (scheduler).
I am trying to increase the crawl rate of pyspider, but I just cant get it above 80/sec.
My setup is 1 server with schedule result_worker webui processor (low CPU load, 70% memory usage) 3 servers with fetcher (low load) 3x RabbitMQ servers in Cluster, loadbalanced using HAproxy (low load) 1 MySQL server (high spec) (low CPU load, 70% memory usage)
I checked all the servers, but I cant find the bottleneck. If I set the crawl and burst rate to 150/s, I dont see any improvements in crawl rate.
Any idea how I can improve the crawl rate? 80/s is just too slow for the amount servers I have deployed and scaling the fetchers horizontally doesnt affect performance.