Open QingGo opened 6 years ago
真是巧,立刻就看到一个类似的新issue。https://github.com/DormyMo/SpiderKeeper/issues/84#issue-369012488
发起了一个PR,不知道是否能合并到这个分支?https://github.com/DormyMo/SpiderKeeper/pull/85
同样遇到这样的问题
发起了一个PR,不知道是否能合并到这个分支?#85
前一阵子发现定时任务有时候没有按计划来执行,仔细检查程序输出日志,发现报错如下:
2018-10-11 16:18:50,165 - SpiderKeeper.app - DEBUG - [sync_job_execution_status] Job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-11 08:18:59 UTC)" raised an exception Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/apscheduler/executors/base.py", line 125, in run_job retval = job.func(*job.args, **job.kwargs) File "/usr/local/lib/python3.5/site-packages/SpiderKeeper/app/schedulers/common.py", line 13, in sync_job_execution_status_job for project in Project.query.all(): File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2703, in all return list(self) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__ return self._execute_and_instances(context) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances close_with_result=True) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args **kw File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session conn = self.session.connection(**kw) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 966, in connection execution_options=execution_options) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 971, in _connection_for_bind engine, execution_options) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 382, in _connection_for_bind self._assert_active() File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 276, in _assert_active % self._rollback_exception sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (sqlite3.IntegrityError) NOT NULL constraint failed: sk_job_execution.service_job_execution_id [SQL: 'INSERT INTO sk_job_execution (date_created, date_modified, project_id, service_job_execution_id, job_instance_id, create_time, start_time, end_time, running_status, running_on) VALUES (CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (11, None, 12, '2018-10-11 16:12:00.028118', None, None, 0, 'http://192.168.1.245:6800')]
一开始我还以为是Project.query.all()出错,仔细一看原来是同一个事务中的前一条用于插入的sql语句出错了。因为错误输出中没有指明出错的代码位置,我猜测这条sql语句是对应sqlalchemy里的session.add语句。最终定位到原因可能在文件/SpiderKeeper/app/proxy/contrib/scrapy.py中,开启爬虫任务的方法start_spider有可能会返回空值,而sqlite里sk_job_execution表的service_job_execution_id字段又有非空约束。
def start_spider(self, project_name, spider_name, arguments): post_data = dict(project=project_name, spider=spider_name) post_data.update(arguments) data = request("post", self._scrapyd_url() + "/schedule.json", data=post_data, return_type="json") return data['jobid'] if data and data['status'] == 'ok' else None
我的SpiderKeeper和Scrapyd部署在同一台机器的不同docker上,Scrapyd上部署了十来只爬虫。Scrapyd上也没有运行定时任务的记录,看来问题在于SpiderKeeper请求Scrapyd开始任务这一步没有成功。也许是网络问题,也许是Scrapyd本身的问题,不知道有没有人遇到和我一样的情况?另外为了提高定时任务的容错性,能否把start_spider方法改为类似于下面的形式:
def start_spider(self, project_name, spider_name, arguments): post_data = dict(project=project_name, spider=spider_name) post_data.update(arguments) data = request("post", self._scrapyd_url() + "/schedule.json", data=post_data, return_type="json") if data and data['status'] == 'ok': return data['jobid'] else: import time time.sleep(3) return self.start_spider(project_name, spider_name, arguments)
或许还能加上一个可配置的最大重试次数?
重试之后有用吗??
发起了一个PR,不知道是否能合并到这个分支?#85
前一阵子发现定时任务有时候没有按计划来执行,仔细检查程序输出日志,发现报错如下:
2018-10-11 16:18:50,165 - SpiderKeeper.app - DEBUG - [sync_job_execution_status] Job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-11 08:18:59 UTC)" raised an exception Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/apscheduler/executors/base.py", line 125, in run_job retval = job.func(*job.args, **job.kwargs) File "/usr/local/lib/python3.5/site-packages/SpiderKeeper/app/schedulers/common.py", line 13, in sync_job_execution_status_job for project in Project.query.all(): File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2703, in all return list(self) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__ return self._execute_and_instances(context) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances close_with_result=True) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args **kw File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session conn = self.session.connection(**kw) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 966, in connection execution_options=execution_options) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 971, in _connection_for_bind engine, execution_options) File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 382, in _connection_for_bind self._assert_active() File "/usr/local/lib/python3.5/site-packages/sqlalchemy/orm/session.py", line 276, in _assert_active % self._rollback_exception sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (sqlite3.IntegrityError) NOT NULL constraint failed: sk_job_execution.service_job_execution_id [SQL: 'INSERT INTO sk_job_execution (date_created, date_modified, project_id, service_job_execution_id, job_instance_id, create_time, start_time, end_time, running_status, running_on) VALUES (CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (11, None, 12, '2018-10-11 16:12:00.028118', None, None, 0, 'http://192.168.1.245:6800')]
一开始我还以为是Project.query.all()出错,仔细一看原来是同一个事务中的前一条用于插入的sql语句出错了。因为错误输出中没有指明出错的代码位置,我猜测这条sql语句是对应sqlalchemy里的session.add语句。最终定位到原因可能在文件/SpiderKeeper/app/proxy/contrib/scrapy.py中,开启爬虫任务的方法start_spider有可能会返回空值,而sqlite里sk_job_execution表的service_job_execution_id字段又有非空约束。
def start_spider(self, project_name, spider_name, arguments): post_data = dict(project=project_name, spider=spider_name) post_data.update(arguments) data = request("post", self._scrapyd_url() + "/schedule.json", data=post_data, return_type="json") return data['jobid'] if data and data['status'] == 'ok' else None
我的SpiderKeeper和Scrapyd部署在同一台机器的不同docker上,Scrapyd上部署了十来只爬虫。Scrapyd上也没有运行定时任务的记录,看来问题在于SpiderKeeper请求Scrapyd开始任务这一步没有成功。也许是网络问题,也许是Scrapyd本身的问题,不知道有没有人遇到和我一样的情况?另外为了提高定时任务的容错性,能否把start_spider方法改为类似于下面的形式:
def start_spider(self, project_name, spider_name, arguments): post_data = dict(project=project_name, spider=spider_name) post_data.update(arguments) data = request("post", self._scrapyd_url() + "/schedule.json", data=post_data, return_type="json") if data and data['status'] == 'ok': return data['jobid'] else: import time time.sleep(3) return self.start_spider(project_name, spider_name, arguments)
或许还能加上一个可配置的最大重试次数?
重试之后有用吗??
有用是有用,不过之后又遇到了其它同步状态出错得问题。我的理解是这样的。SpiderKeeper调用Scrapyd的任何一个API都有可能会各种原因失败(比如网络异常,或者scrapyd本身被请求得太频繁导致堵塞),从而造成两者状态不同步,我觉得对于错误应该要加上相应的处理机制,比如在界面提示你操作失败,或者自动重试。不过我现在已经不用SpiderKeeper了,而是改用celery+celery-beat来管理定时任务
碰到同样的问题了,定时任务有时不能正常启动
Why not try ScrapydWeb
试试去,看到1.2支持定制任务了Timer Tasks
碰到同样的问题了,定时任务有时不能正常启动
你解决了吗?你是怎么解决的
碰到同样的问题了,定时任务有时不能正常启动
你解决了吗?你是怎么解决的
不用spiderkeeper,用了scrapydweb。spiderkeeper这个项目作者不维护了,任务一多sqlite锁死
前一阵子发现定时任务有时候没有按计划来执行,仔细检查程序输出日志,发现报错如下:
一开始我还以为是Project.query.all()出错,仔细一看原来是同一个事务中的前一条用于插入的sql语句出错了。因为错误输出中没有指明出错的代码位置,我猜测这条sql语句是对应sqlalchemy里的session.add语句。最终定位到原因可能在文件/SpiderKeeper/app/proxy/contrib/scrapy.py中,开启爬虫任务的方法start_spider有可能会返回空值,而sqlite里sk_job_execution表的service_job_execution_id字段又有非空约束。
我的SpiderKeeper和Scrapyd部署在同一台机器的不同docker上,Scrapyd上部署了十来只爬虫。Scrapyd上也没有运行定时任务的记录,看来问题在于SpiderKeeper请求Scrapyd开始任务这一步没有成功。也许是网络问题,也许是Scrapyd本身的问题,不知道有没有人遇到和我一样的情况?另外为了提高定时任务的容错性,能否把start_spider方法改为类似于下面的形式:
或许还能加上一个可配置的最大重试次数?