binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.51k stars 3.69k forks source link

why not save task instead of taskid in scheduler task_queue ? #274

Open eromoe opened 9 years ago

eromoe commented 9 years ago

In scheduler._check_select, call self.taskdb.get_task . This behave slow down pyspider. Why not just save task instead of taskid?

    def _check_select(self):
        #.....

        taskids = []
        cnt = 0
        cnt_dict = dict()
        limit = self.LOOP_LIMIT
        for project, task_queue in iteritems(self.task_queue):
            if cnt >= limit:
                break

            # task queue
            self.task_queue[project].check_update()
            project_cnt = 0

            # check send_buffer here. when not empty, out_queue may blocked. Not sending tasks
            while cnt < limit and project_cnt < limit / 10:
                taskid = task_queue.get()
                if not taskid:
                    break

                taskids.append((project, taskid))
                project_cnt += 1
                cnt += 1
            cnt_dict[project] = project_cnt

        for project, taskid in taskids:
            task = self.taskdb.get_task(project, taskid, fields=self.request_task_fields)
            if not task:
                continue
            task = self.on_select_task(task)

        return cnt_dict
binux commented 9 years ago

To queue the tasks up in scheduler, we need put all of the tasks which is in active status in memory. Which might be a huge consuming of memory resources.

eromoe commented 9 years ago

ACTIVE_TASKS = 100 would not cost too much memory I think. Now, one task flow has at least four access to database: get task(check), insert task, get task(select), update (done). Crawling speed has huge decreasement when my data increase from 10K to 30K.That's why I want to remove the two get.For now , I just save task in queue, and always insert(upsert in mongo) a task to taskdb. It would be very nice to separate the operation of taskdb from scheduler.

binux commented 9 years ago

ACTIVE_TASKS is not the meaning of tasks keeped in memory, this number is the size of deque of last active task for debug usage. I would like to separate the operation of multiple threads to increase the performance. I will consider about making a mechanism to switch between hold task info in memory and from database.