Open eromoe opened 9 years ago
To queue the tasks up in scheduler, we need put all of the tasks which is in active status in memory. Which might be a huge consuming of memory resources.
ACTIVE_TASKS = 100
would not cost too much memory I think.
Now, one task flow has at least four access to database: get task(check), insert task, get task(select), update (done). Crawling speed has huge decreasement when my data increase from 10K to 30K.That's why I want to remove the two get
.For now , I just save task
in queue, and always insert(upsert in mongo) a task to taskdb.
It would be very nice to separate the operation of taskdb from scheduler.
ACTIVE_TASKS is not the meaning of tasks keeped in memory, this number is the size of deque of last active task for debug usage. I would like to separate the operation of multiple threads to increase the performance. I will consider about making a mechanism to switch between hold task info in memory and from database.
In
scheduler._check_select
, callself.taskdb.get_task
. This behave slow down pyspider. Why not just save task instead of taskid?