DormyMo / SpiderKeeper

admin ui for scrapy/open source scrapinghub
http://sk.7mdm.com:5000/
2.74k stars 508 forks source link

删除掉项目以后,SpiderKeeper依然会尝试和Scrapyd同步状态 #86

Open QingGo opened 6 years ago

QingGo commented 6 years ago

我是直接调用api来删除SpiderKeeper的项目的:

for i in range(2, 19):
    project_delete_url = 'http://localhost:5000/project/{}/delete'.format(i)
    r = session.get(project_delete_url, auth=('admin','admin'))

删除任务以后发现,在Docker里面Scrapyd的容器CPU占用率接近100%,而SpiderKeeper的日志出现了以下信息:

xecution of job "sync_spiders (trigger: interval[0:00:10], next run at: 2018-10-15 16:47:44 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:47:49 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:47:54 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_spiders (trigger: interval[0:00:10], next run at: 2018-10-15 16:47:54 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:47:59 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:48:04 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_spiders (trigger: interval[0:00:10], next run at: 2018-10-15 16:48:04 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:48:19 CST)" skipped: maximum number of running instances reached (1)

尝试停掉Scrapyd,然后SpiderKeeper里面自然而然地出现一堆请求Scrapyd的listjobs和listspiders的接口失败的警告,然而奇怪的是,后面的?project=接的都是已经已经删除的项目,猜测原因是删除项目以后(Scrapyd上已经删除了这个项目了),SpiderKeeper没有在自己的sqlite数据库里删除对应的定期任务。

另外删除项目后在SpiderKeeper界面上新建新的项目,项目下也会显示原来项目的运行记录。猜测原因是删除项目以后(Scrapyd上已经删除了这个项目了),SpiderKeeper没有在自己的sqlite数据库里删除对应的任务运行记录。

也许是我调用api来删除SpiderKeeper的项目的姿势不对?求帮助。

QingGo commented 6 years ago

另外暴力删除掉SpiderKeeper.db以后,发现SpiderKeeper好像不能自动同步Scrapyd上已有的项目信息。

3inchtime commented 6 years ago

我也遇到同样的问题,请问如何解决

QingGo commented 6 years ago

我也遇到同样的问题,请问如何解决

最终我还是没解决这个问题,所以放弃spiderkeep改用celery-beat来管理定时任务了

3inchtime commented 6 years ago

我也遇到同样的问题,请问如何解决

最终我还是没解决这个问题,所以放弃spiderkeep改用celery-beat来管理定时任务了

难受啊

3inchtime commented 6 years ago

我也遇到同样的问题,请问如何解决

最终我还是没解决这个问题,所以放弃spiderkeep改用celery-beat来管理定时任务了

现在基本上确定是代码中出现了某些问题让scrapyd阻塞了,与Spiderkeeper无关。但是同一爬虫在不同机器上就不会出现问题,懵逼

Ericliu68 commented 5 years ago

你可以多看看db的内容,这个应该在源码里面加一个删除db里面的project,job

QingGo commented 5 years ago

SpiderKeeper调用Scrapyd的任何一个API都有可能会各种原因失败(比如网络异常,或者scrapyd本身被请求得太频繁导致堵塞),从而造成两者状态不同步,我觉得对于错误应该要加上相应的处理机制,比如在界面提示你操作失败,或者自动重试

QingGo commented 5 years ago

不过我已经不用SpiderKeeper了,现在改用celery+celery-beat来管理定时任务

Ericliu68 commented 5 years ago

scrapyd可以设置同时运行爬虫的进程数,其实我想知道celery+celery-beat怎么调用scrapy爬虫,有推荐教程吗?

QingGo commented 5 years ago

scrapyd可以设置同时运行爬虫的进程数,其实我想知道celery+celery-beat怎么调用scrapy爬虫,有推荐教程吗?

celery相关的教程可以看官网,celery+celery-beat只是用来定时发异步的请求,在python调用Scrapyd你可以试试python-scrapyd-api这个库

Ericliu68 commented 5 years ago

好的,谢谢