binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.49k stars 3.69k forks source link

Retention policy for taskdb #721

Open volvofixthis opened 7 years ago

volvofixthis commented 7 years ago

Is this possible? What do you think? Currently i have taskdb around 5.3gb, i have very simple projects, which crawl new information from one main page, but there is plenty of them, around of 1000 now. So i think i need do some cleaup in taskdb.

Maybe you have some recommendations and know where i can have bottleneck? And can you tell something about threading support in scheduler?

And what you can tell about redis as taskdb? It is ok to use it? I found that scheduler works very strange with it.

binux commented 7 years ago

You can delete the tasks which last-crawl-time + ago < now I cannot put such code by default because by theory, ago is coming from script, you can set larger ago after deleting the task.

But I think it's fine to have it in helper scripts.

On Fri, 7 Jul 2017, 11:12 volvofixthis, notifications@github.com wrote:

Is this possible? What do you think? Currently i have taskdb around 5.3gb, i have very simple projects, which crawl new information from one main page, but there is plenty of them. So i think i need do some cleaup in taskdb.

Maybe you have some recommendations and know where i can have bottleneck? And can you tell something about threading support in scheduler?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/binux/pyspider/issues/721, or mute the thread https://github.com/notifications/unsubscribe-auth/AAndM_xmVuf0ZlwOFDgBH-J3-d3koLEkks5sLgSQgaJpZM4OQwTr .

binux commented 7 years ago

Scheduler have thread support.

On Fri, 7 Jul 2017, 11:23 Roy Binux, roy@binux.me wrote:

You can delete the tasks which last-crawl-time + ago < now I cannot put such code by default because by theory, ago is coming from script, you can set larger ago after deleting the task.

But I think it's fine to have it in helper scripts.

On Fri, 7 Jul 2017, 11:12 volvofixthis, notifications@github.com wrote:

Is this possible? What do you think? Currently i have taskdb around 5.3gb, i have very simple projects, which crawl new information from one main page, but there is plenty of them. So i think i need do some cleaup in taskdb.

Maybe you have some recommendations and know where i can have bottleneck? And can you tell something about threading support in scheduler?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/binux/pyspider/issues/721, or mute the thread https://github.com/notifications/unsubscribe-auth/AAndM_xmVuf0ZlwOFDgBH-J3-d3koLEkks5sLgSQgaJpZM4OQwTr .

volvofixthis commented 7 years ago

I saw thread scheduler support, is it fully functional? And what you can say about redis as taskdb?

binux commented 7 years ago

Yes, it's used by default. Performance of redis taskdb is the best, as far as you have enough memory.