Open volvofixthis opened 7 years ago
You can delete the tasks which last-crawl-time + ago < now I cannot put such code by default because by theory, ago is coming from script, you can set larger ago after deleting the task.
But I think it's fine to have it in helper scripts.
On Fri, 7 Jul 2017, 11:12 volvofixthis, notifications@github.com wrote:
Is this possible? What do you think? Currently i have taskdb around 5.3gb, i have very simple projects, which crawl new information from one main page, but there is plenty of them. So i think i need do some cleaup in taskdb.
Maybe you have some recommendations and know where i can have bottleneck? And can you tell something about threading support in scheduler?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/binux/pyspider/issues/721, or mute the thread https://github.com/notifications/unsubscribe-auth/AAndM_xmVuf0ZlwOFDgBH-J3-d3koLEkks5sLgSQgaJpZM4OQwTr .
Scheduler have thread support.
On Fri, 7 Jul 2017, 11:23 Roy Binux, roy@binux.me wrote:
You can delete the tasks which last-crawl-time + ago < now I cannot put such code by default because by theory, ago is coming from script, you can set larger ago after deleting the task.
But I think it's fine to have it in helper scripts.
On Fri, 7 Jul 2017, 11:12 volvofixthis, notifications@github.com wrote:
Is this possible? What do you think? Currently i have taskdb around 5.3gb, i have very simple projects, which crawl new information from one main page, but there is plenty of them. So i think i need do some cleaup in taskdb.
Maybe you have some recommendations and know where i can have bottleneck? And can you tell something about threading support in scheduler?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/binux/pyspider/issues/721, or mute the thread https://github.com/notifications/unsubscribe-auth/AAndM_xmVuf0ZlwOFDgBH-J3-d3koLEkks5sLgSQgaJpZM4OQwTr .
I saw thread scheduler support, is it fully functional? And what you can say about redis as taskdb?
Yes, it's used by default. Performance of redis taskdb is the best, as far as you have enough memory.
Is this possible? What do you think? Currently i have taskdb around 5.3gb, i have very simple projects, which crawl new information from one main page, but there is plenty of them, around of 1000 now. So i think i need do some cleaup in taskdb.
Maybe you have some recommendations and know where i can have bottleneck? And can you tell something about threading support in scheduler?
And what you can tell about redis as taskdb? It is ok to use it? I found that scheduler works very strange with it.