binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.47k stars 3.69k forks source link

Crawl rate bottleneck #138

Open laapsaap opened 9 years ago

laapsaap commented 9 years ago

I am trying to increase the crawl rate of pyspider, but I just cant get it above 80/sec.

My setup is 1 server with schedule result_worker webui processor (low CPU load, 70% memory usage) 3 servers with fetcher (low load) 3x RabbitMQ servers in Cluster, loadbalanced using HAproxy (low load) 1 MySQL server (high spec) (low CPU load, 70% memory usage)

I checked all the servers, but I cant find the bottleneck. If I set the crawl and burst rate to 150/s, I dont see any improvements in crawl rate.

Any idea how I can improve the crawl rate? 80/s is just too slow for the amount servers I have deployed and scaling the fetchers horizontally doesnt affect performance.

binux commented 9 years ago

What is the status of message queue(by checking the message number in each queue via web interface of rabbitmq) when crawl rate become stable.

laapsaap commented 9 years ago

Not sure what you mean.

If I check overview statistics

Queued Message (chart last minute) avg 150msg Ready

Message Rates (chart last minute) Publish 232/s Get (noack) 190/s

All 5 queues State is running. My crawl rate is 200.0/200.0 but my new task for 5m is 16800. So 16800/300 = 56/s

binux commented 9 years ago

Please change the --queue-maxsize to higher value (default is 100, change it to 1000 or 10000, don't change it to 0): http://docs.pyspider.org/en/latest/Command-Line/#-queue-maxsize

--queue-maxsize is a limitation of the size of message queue between componts. When the queue is full, front compont will wait some seconds for next compont.

laapsaap commented 9 years ago

I added "queue-maxsize": 1000 to config.json. But performance is the same, not much happened.

binux commented 9 years ago

How about the statistics of rabbitmq? Is it still 150 msg ready?

You need change the config on every component instance on all of the servers.

laapsaap commented 9 years ago

image

All my pyspider process access the same config.json (shared file).

binux commented 9 years ago

Can you show me this panel:

screenshot 2015-02-24 00 16 15

laapsaap commented 9 years ago

image

200.0/200.0 rate 5m: 17327 all: 34.9% pending 21mil

binux commented 9 years ago

Try changing "queue-maxsize" to 0 to find out which queue is block up? I will do some test to find out the bottleneck too.

binux commented 9 years ago

After some tests. I found that the maximum publish(put) rate of rabbitmq of current message queue implement using pika.BlockingConnection is about 290/s (btw, ampq is much faster then pika: 9000/s, default used in python3), addition of the cost of database. (Scheduler will query the database for the detail of a task and put it to message queue) I think 80/s is explainable.

screenshot 2015-02-27 10 59 43

I will do more test and looking for a solution.

laapsaap commented 9 years ago

Yeah I am using queue-maxsize is 0 for a few days and its still 5m 15500. Its running quite stable at about 80mil crawled pages. But this is too slow, I will spend sometime today to see if I can improve it.

binux commented 9 years ago

If you are using python2, try change Queue = PikaQueue to Queue = AmqpQueue to force pyspider use libamqp when connect to rabbitmq, it's faster in my test.

NOTE: delete all the queues via webui before restart, the default parameter of amqp is different with pika

laapsaap commented 9 years ago

I am using python 2 so after changing queue to PikaQueue, it went from 15500/5m to 21500/5m. A small bump in performance.

But now I am going through a lot of ignore tasks, so I cant check the real performance increase.

I do see this now happening sometimes;

[E 150303 13:35:18 scheduler:405] Socket closed
    Traceback (most recent call last):
      File "/opt/pyspider/pyspider/scheduler/scheduler.py", line 400, in run
        self.run_once()
      File "/opt/pyspider/pyspider/scheduler/scheduler.py", line 385, in run_once
        self._check_task_done()
      File "/opt/pyspider/pyspider/scheduler/scheduler.py", line 206, in _check_task_done
        task = self.status_queue.get_nowait()
      File "/opt/pyspider/pyspider/libs/rabbitmq.py", line 31, in wrap
        return func(self, *args, **kwargs)
      File "/opt/pyspider/pyspider/libs/rabbitmq.py", line 246, in get_nowait
        message = self.channel.basic_get(self.name, not ack)
      File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1954, in basic_get
        (60, 72),  # Channel.basic_get_empty
      File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait
        self.channel_id, allowed_methods)
      File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 240, in _wait_method
        self.method_reader.read_method()
      File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
        raise m
    IOError: Socket closed
[E 150303 13:35:19 rabbitmq:39] RabbitMQ error: error(32, 'Broken pipe'), reconnect.
[E 150303 13:35:19 rabbitmq:39] RabbitMQ error: error(32, 'Broken pipe'), reconnect.
binux commented 9 years ago

I made more tests of taskdb, all test is running in a vagrant centos box.

mysql 5.1.73 + mysql-connector: mysql+taskdb://127.0.0.1:3306/taskdb

[I 150304 11:22:47 bench:60] taskdb insert 1000 [I 150304 11:22:48 bench:70] cost 1.63s, 614.96/s 1.63ms [I 150304 11:22:48 bench:73] taskdb update 1000 [I 150304 11:22:50 bench:83] cost 1.77s, 564.67/s 1.77ms [I 150304 11:22:50 bench:97] taskdb get 1000 randomly [I 150304 11:22:51 bench:111] cost 1.07s, 936.03/s 1.07ms [I 150304 11:22:51 bench:60] taskdb insert 10000 [I 150304 11:23:08 bench:70] cost 16.62s, 601.67/s 1.66ms [I 150304 11:23:08 bench:73] taskdb update 10000 [I 150304 11:23:26 bench:83] cost 18.11s, 552.24/s 1.81ms [I 150304 11:23:26 bench:97] taskdb get 10000 randomly [I 150304 11:23:37 bench:111] cost 10.78s, 927.44/s 1.08ms

mysql 5.1.73 + sqlalchemy + mysql-connector: sqlalchemy+mysql+mysqlconnector+taskdb://root@127.0.0.1:3306/taskdb

[I 150304 11:42:07 bench:60] taskdb insert 1000 [I 150304 11:42:08 bench:70] cost 1.16s, 859.96/s 1.16ms [I 150304 11:42:08 bench:73] taskdb update 1000 [I 150304 11:42:09 bench:83] cost 1.58s, 633.41/s 1.58ms [I 150304 11:42:09 bench:97] taskdb get 1000 randomly [I 150304 11:42:11 bench:111] cost 1.96s, 510.25/s 1.96ms [I 150304 11:42:11 bench:60] taskdb insert 10000 [I 150304 11:42:22 bench:70] cost 11.14s, 898.01/s 1.11ms [I 150304 11:42:22 bench:73] taskdb update 10000 [I 150304 11:42:37 bench:83] cost 14.94s, 669.31/s 1.49ms [I 150304 11:42:37 bench:97] taskdb get 10000 randomly [I 150304 11:42:57 bench:111] cost 19.35s, 516.71/s 1.94ms

mysql 5.1.73 + sqlalchemy + mysqldb: sqlalchemy+mysql+mysqldb+taskdb://root@127.0.0.1:3306/taskdb

[I 150304 11:40:38 bench:60] taskdb insert 1000 [I 150304 11:40:38 bench:70] cost 0.92s, 1086.25/s 0.92ms [I 150304 11:40:38 bench:73] taskdb update 1000 [I 150304 11:40:40 bench:83] cost 1.22s, 817.67/s 1.22ms [I 150304 11:40:40 bench:97] taskdb get 1000 randomly [I 150304 11:40:41 bench:111] cost 1.50s, 666.52/s 1.50ms [I 150304 11:40:41 bench:60] taskdb insert 10000 [I 150304 11:40:50 bench:70] cost 9.14s, 1094.53/s 0.91ms [I 150304 11:40:50 bench:73] taskdb update 10000 [I 150304 11:41:03 bench:83] cost 12.54s, 797.42/s 1.25ms [I 150304 11:41:03 bench:97] taskdb get 10000 randomly [I 150304 11:41:19 bench:111] cost 15.77s, 634.21/s 1.58ms

mongodb v2.4.6 + pymongo: mongodb+taskdb://127.0.0.1:27017/taskdb

[I 150304 11:25:52 bench:60] taskdb insert 1000 [I 150304 11:25:52 bench:70] cost 0.60s, 1668.70/s 0.60ms [I 150304 11:25:52 bench:73] taskdb update 1000 [I 150304 11:25:53 bench:83] cost 0.49s, 2049.47/s 0.49ms [I 150304 11:25:53 bench:97] taskdb get 1000 randomly [I 150304 11:25:54 bench:111] cost 0.95s, 1050.10/s 0.95ms [I 150304 11:25:54 bench:60] taskdb insert 10000 [I 150304 11:26:24 bench:70] cost 30.46s, 328.33/s 3.05ms [I 150304 11:26:24 bench:73] taskdb update 10000 [I 150304 11:26:36 bench:83] cost 11.47s, 872.05/s 1.15ms [I 150304 11:26:36 bench:97] taskdb get 10000 randomly [I 150304 11:27:13 bench:111] cost 37.16s, 269.13/s 3.72ms

postgresql 9.4 + psycopg2: sqlalchemy+postgresql+taskdb://postgres@127.0.0.1:5432/taskdb

[I 150304 11:51:07 bench:60] taskdb insert 1000 [I 150304 11:51:09 bench:70] cost 1.40s, 716.79/s 1.40ms [I 150304 11:51:09 bench:73] taskdb update 1000 [I 150304 11:51:10 bench:83] cost 1.72s, 579.79/s 1.72ms [I 150304 11:51:10 bench:97] taskdb get 1000 randomly [I 150304 11:51:12 bench:111] cost 1.58s, 632.50/s 1.58ms [I 150304 11:51:12 bench:60] taskdb insert 10000 [I 150304 11:51:26 bench:70] cost 14.38s, 695.59/s 1.44ms [I 150304 11:51:26 bench:73] taskdb update 10000 [I 150304 11:51:44 bench:83] cost 17.97s, 556.41/s 1.80ms [I 150304 11:51:44 bench:97] taskdb get 10000 randomly [I 150304 11:52:00 bench:111] cost 16.11s, 620.56/s 1.61ms

sqlite on disk: sqlite+taskdb:///data/taskdb

[I 150304 11:28:20 bench:60] taskdb insert 1000 [I 150304 11:28:25 bench:70] cost 4.55s, 219.62/s 4.55ms [I 150304 11:28:25 bench:73] taskdb update 1000 [I 150304 11:28:30 bench:83] cost 4.57s, 218.83/s 4.57ms [I 150304 11:28:30 bench:97] taskdb get 1000 randomly [I 150304 11:28:31 bench:111] cost 0.95s, 1050.34/s 0.95ms [I 150304 11:28:31 bench:60] taskdb insert 10000 [I 150304 11:29:16 bench:70] cost 45.24s, 221.05/s 4.52ms [I 150304 11:29:16 bench:73] taskdb update 10000 [I 150304 11:30:04 bench:83] cost 47.96s, 208.52/s 4.80ms [I 150304 11:30:04 bench:97] taskdb get 10000 randomly [I 150304 11:30:13 bench:111] cost 9.64s, 1037.19/s 0.96ms

sqlite on memory: sqlite+taskdb://

[I 150304 11:31:07 bench:60] taskdb insert 1000 [I 150304 11:31:07 bench:70] cost 0.12s, 8385.94/s 0.12ms [I 150304 11:31:07 bench:73] taskdb update 1000 [I 150304 11:31:08 bench:83] cost 0.22s, 4640.29/s 0.22ms [I 150304 11:31:08 bench:97] taskdb get 1000 randomly [I 150304 11:31:08 bench:111] cost 0.44s, 2258.38/s 0.44ms [I 150304 11:31:08 bench:60] taskdb insert 10000 [I 150304 11:31:09 bench:70] cost 1.20s, 8316.93/s 0.12ms [I 150304 11:31:09 bench:73] taskdb update 10000 [I 150304 11:31:11 bench:83] cost 2.17s, 4606.31/s 0.22ms [I 150304 11:31:11 bench:97] taskdb get 10000 randomly [I 150304 11:31:16 bench:111] cost 4.66s, 2143.95/s 0.47ms

laapsaap commented 9 years ago

"python run.py -c config.json bench"

[I 150306 10:09:24 bench:60] taskdb insert 1000 [I 150306 10:09:31 bench:70] cost 6.72s, 148.82/s 6.72ms [I 150306 10:09:31 bench:73] taskdb update 1000 [I 150306 10:09:38 bench:83] cost 6.83s, 146.48/s 6.83ms [I 150306 10:09:38 bench:97] taskdb get 1000 randomly [I 150306 10:09:39 bench:111] cost 1.16s, 858.78/s 1.16ms [I 150306 10:09:39 bench:60] taskdb insert 10000 [I 150306 10:10:41 bench:70] cost 62.45s, 160.13/s 6.24ms [I 150306 10:10:41 bench:73] taskdb update 10000 [I 150306 10:11:45 bench:83] cost 63.50s, 157.47/s 6.35ms [I 150306 10:11:45 bench:97] taskdb get 10000 randomly [I 150306 10:11:57 bench:111] cost 11.81s, 846.97/s 1.18ms [I 150306 10:11:57 bench:143] message queue put 1000 [I 150306 10:11:57 bench:152] cost 0.31s, 3187.43/s 0.31ms [I 150306 10:11:57 bench:155] message queue get 1000

definately bottleneck with inserts, I assume mysql being bottleneck cant imagine rabbitmq being so slow.

laapsaap commented 9 years ago

So the bottleneck was innodb insert/updates. I changed to tokudb as backend engine and turning off commit log's sync. Performance increased a lot. Problem with innodb is at some point the results table for project grows over XX gb and server keep wanting more memory. At some point for big crawls there is just no memory left.

[I 150306 15:31:00 bench:60] taskdb insert 1000 [I 150306 15:31:01 bench:70] cost 0.94s, 1060.44/s 0.94ms [I 150306 15:31:01 bench:73] taskdb update 1000 [I 150306 15:31:02 bench:83] cost 0.85s, 1179.29/s 0.85ms [I 150306 15:31:02 bench:97] taskdb get 1000 randomly [I 150306 15:31:03 bench:111] cost 1.11s, 897.00/s 1.11ms [I 150306 15:31:03 bench:60] taskdb insert 10000 [I 150306 15:31:11 bench:70] cost 7.58s, 1319.97/s 0.76ms [I 150306 15:31:11 bench:73] taskdb update 10000 [I 150306 15:31:19 bench:83] cost 8.40s, 1190.91/s 0.84ms [I 150306 15:31:19 bench:97] taskdb get 10000 randomly [I 150306 15:31:30 bench:111] cost 11.40s, 876.87/s 1.14ms [I 150306 15:31:31 bench:143] message queue put 1000 [I 150306 15:31:31 bench:152] cost 0.36s, 2777.79/s 0.36ms [I 150306 15:31:31 bench:155] message queue get 1000

binux commented 9 years ago

For database. Current implementation of scheduler get every task by key one by one and feed to message queue, the performance could be better if we select multiple items once or cache some hot items into memory or use multiple threads. As for your case, you added tasks in seed page once, therefore the get should be the bottleneck.

For rabbitmq. The implementation is using the basic_publish and basic_get API of rabbitmq to imitate the built-in Queue. The performance should be better using publisher and consumer pattern.

laapsaap commented 9 years ago

image

Very sick performance right now.

binux commented 9 years ago

Database commit on every insert command is another weak point.

laapsaap commented 9 years ago

Yeah transaction/commit would definitely improve performance. Right now this is an empty database, so performance is quite high still. In my tests the performance decreases overtime when the database grow due to what you just said about the current implementation. I tried different tuning with innodb and tokudb, ~300/s is my maximum while benchmark do 1000+/s

binux commented 9 years ago

scheduler needs get from message queue and get from database to check whether the task is crawled then insert into database. So it's slower than just insert in benchmark.

laapsaap commented 9 years ago

image

How is this possible that 1d count is higher than all count? I am getting good results replacing innodb with tokudb, lower memory and disk space usage.

binux commented 9 years ago

The 1d counter is the number of events while all is the number of tasks. Mouse over the progress bar you could find the different words used in the details.

laapsaap commented 9 years ago

I am getting a steady ~75000/5m nowadays. I tested Redis today, for me the performance was the same as RabbitMQ.

We all know the scheduler is the bottleneck, but I am not entirely sure what.

My database can deal with the qps easily, its about 700qps with low load. My mysql database is heavily tweaked for pyspider.

I run processors, fetchers and rabbitmq on different servers connected on 1gbit LAN. All of them are low load and low IO.

Then there's the one scheduler, running about 60% consistently on the CPU. So what is the bottleneck here? CPU? I find it weird that scheduler process is not maxing the CPU thread.

I tried everything. Thinking about running Scheduler on a even more powerful CPU to see how much it differ. I also wonder if mongodb will increase any performance..

binux commented 9 years ago

For a new request operation(blue part), it needs 1 queue.get + 1 database.get + 1 database.insert/update operations. With your benchmark it needs approximately 0.36ms + 1.14ms + 0.76ms = 2.26ms = 442 qps. (NOTE: the requests with an available age (would not recrawl) is not counted)

For a success task operation(in green), it needs 1 database.get + 1 queue.put + 1 queue.get + 1 database.update operations. It needs 1.14ms + 0.31ms + 0.36ms + 0.76ms = 2.57ms = 389 qps. (it's limited to 200?)

As I had replied in https://github.com/binux/pyspider/issues/181#issuecomment-93629437. Scheduler is single thread, it have to handle the tasks one by one. Most of time it's waiting for the database or message queue to response.I had de-duplicate same tasks within one loop to reduce the number of tasks, and would try to use multi-thread or multi-processes to improve the performance.

laapsaap commented 9 years ago

What is limited to 200? I can improve cps (crawls per second) considerately by using multiple schedulers, its not that hard you only need to figure out in project code what links to save into a different database or table and feed a different scheduler those links. It does make project code a bit more complicated because you have control of the flow.

Is it an idea to be able to spawn multiple schedule and assign them to a project instead?

binux commented 9 years ago

Here https://github.com/binux/pyspider/issues/138#issuecomment-78235424 is limited to 200.

That's not that easy to use multiple schedulers. Although you have figured out what links to save into different database, a dispatcher is needed to distribute the links into corresponding scheduler.

One more thing, the priority task queue and traffic control(token bucket) are not easy to separate into multiple scheduler, especially separate single project into multiple scheduler. So multi-thread is much easier to control and only the database process is parallel.

laapsaap commented 9 years ago

Oh yeah I increased the craw/burst rate to 500/500 nowadays, but only at a start of a new project I reach 100K+/5m afterwards stabilize at ~75K

binux commented 9 years ago

With single thread, I think it's hard to improve... BTW: what's your expected QPS?

laapsaap commented 9 years ago

I just like pushing limits, but practically it would be nice to have steady 90K/5m so about 300qps. But theoretically there shouldnt be a limit :)

Indeed, its hard to increase performance in its current state. Right now the only way for me is to run multiple pyspider instances. But this is not for everyone, many sites cant handle 300+ hits/sec (DoS) and proxies list will get saturated very quickly.

I will spend some time in the code soon (scheduler).