XiaoMi / minos

Minos is beyond a hadoop deployment system.
Apache License 2.0
522 stars 200 forks source link

meet a problem when use owl to monitor yarn #6

Closed zenglinxi0615 closed 10 years ago

zenglinxi0615 commented 10 years ago

当在owl的web页面上点击yarn的某个task id时,无法正常进入由opentsdb监控视图组成的页面,而是报错:“A server error occurred. Please contact the administrator.”

查看日志serve.log,发现以下问题: [02/Jan/2014 15:49:15] "GET /monitor/task/225 HTTP/1.1" 301 0 Traceback (most recent call last): File "/usr/local/lib/python2.7/wsgiref/handlers.py", line 85, in run self.result = application(self.environ, self.start_response) File "/usr/local/lib/python2.7/site-packages/django/contrib/staticfiles/handlers.py", line 67, in call return self.application(environ, start_response) File "/usr/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 209, in call response = self.get_response(request) File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 200, in get_response response = self.handle_uncaught_exception(request, resolver, sys.exc_info()) File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 230, in handle_uncaught_exception 'request': request File "/usr/local/lib/python2.7/logging/init.py", line 1154, in error self._log(ERROR, msg, args, kwargs) File "/usr/local/lib/python2.7/logging/init.py", line 1246, in _log self.handle(record) File "/usr/local/lib/python2.7/logging/init.py", line 1256, in handle self.callHandlers(record) File "/usr/local/lib/python2.7/logging/init.py", line 1293, in callHandlers hdlr.handle(record) File "/usr/local/lib/python2.7/logging/init.py", line 740, in handle self.emit(record) File "/usr/local/lib/python2.7/site-packages/django/utils/log.py", line 106, in emit connection=self.connection()) File "/usr/local/lib/python2.7/site-packages/django/core/mail/init.py", line 98, in mail_admins mail.send(fail_silently=fail_silently) File "/usr/local/lib/python2.7/site-packages/django/core/mail/message.py", line 284, in send return self.get_connection(fail_silently).send_messages([self]) File "/usr/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 92, in send_messages new_conn_created = self.open() File "/usr/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 51, in open self.connection = connection_class(self.host, self.port, connection_params) File "/usr/local/lib/python2.7/smtplib.py", line 239, in init (code, msg) = self.connect(host, port) File "/usr/local/lib/python2.7/smtplib.py", line 295, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/local/lib/python2.7/smtplib.py", line 273, in _get_socket return socket.create_connection((port, host), timeout) File "/usr/local/lib/python2.7/socket.py", line 567, in create_connection raise error, msg error: [Errno 111] Connection refused

请问这个问题可能由什么原因造成?

wuzesheng commented 10 years ago

看这个栈都是django和python底层的,有没有minos本身的栈相关的信息?

zenglinxi0615 commented 10 years ago

server.log里面没有找到跟minos本身相关的信息,其他日志文件跟这个问题应该没关系

wuzesheng commented 10 years ago

这个看上去是在连某个smtp的server,而这个server没有起,但怀疑这个不是root cause. 从上面现象来看应该是这样的path: owl有问题->django发邮件给管理员->发邮件失败

wuzesheng commented 10 years ago

你贴一下你要点的那个链接,另外看一下后台django日志中该请求对应的http status code

zenglinxi0615 commented 10 years ago

链接是个内网的地址,格式类似于:http://10.10.65.13:8080/monitor/cluster/6/task/,感觉应该是你说的“这样的path: owl有问题->django发邮件给管理员->发邮件失败”,我再检查一下日志。

wuzesheng commented 10 years ago

好,你看看出问题的请求django返回的http status code, 可能会有些帮助

zenglinxi0615 commented 10 years ago

WARNING 2014-01-03 15:07:01,349 collect 16994 140143162611456 <Task: yarn/hadoop-crete/nodemanager/26> failed to update metric: OperationalError(2006, 'MySQL server has gone away')怀疑是数据库连接断开的问题。

zenglinxi0615 commented 10 years ago

owl在更新mysql中的监控数据的时候是先建立mysql连接,然后通过jmx获取json数据,再更新msyql table的吗?

wuzesheng commented 10 years ago

与mysql的连接是底层django维护的,应该是长连接

wuzesheng commented 10 years ago

你能发一下你们搭的owl的collect和mysql各自的cpu使用情况吗?

zenglinxi0615 commented 10 years ago

20182 minos 20 0 101m 23m 3388 S 8 0.0 0:01.99 python2.7
24174 mysql 20 0 265m 41m 7444 S 7 0.1 9:44.78 mysqld

都不大。把收集数据的时间周期设为30,现在mysql的问题没出现了。有个新的问题: WARNING 2014-01-03 16:22:19,891 collect 16772 139817798592256 <Task: hbase/hadoop-crete/master/0> failed to get metric: KeyError('hadoop:service=Master,name=Master',) 我正在调试。感觉昨天发的那个错误应该是在minos/owl/collector/management/commands/collect.py中执行update state的过程中出现问题引发的。

wuzesheng commented 10 years ago

你看下你的hbase的jmx页面上有没有这项:"name" : "hadoop:service=Master,name=Master"

wuzesheng commented 10 years ago

这个问题可能有两个原因:

  1. Hbase version比较老,jmx里没有上面的说的这项
  2. hbase active master启动了,但数据加载没有完成,也不会有上面这项
zenglinxi0615 commented 10 years ago

应该是hbase版本原因,我们用的是0.96的,jmx页面项有所改变:"name" : "Hadoop:service=HBase,name=MetricsSystem,sub=Stats"

wuzesheng commented 10 years ago

哦,明白了,对多个版本的兼容这一块看来要做的事情还比较多。

zenglinxi0615 commented 10 years ago

嗯,这个在代码里面写的比较死,能改成配置项就好了,最好能提供几个现在常用的hbase版本对应的配置(如果这些版本之间jmx有区别的话)

zenglinxi0615 commented 10 years ago

刚刚有点错误,你说的jmx那项对应于0.96版的应该是 Hadoop:service=HBase,name=Master

wuzesheng commented 10 years ago

好,明白了,谢谢反馈。你的建议挺好,我们会考虑。不过目前人力有限,没那么快来做这个事情,所以你这边就先自己改一下用吧。 BTW: 你这边的现在都正常跑起来了吗?

wuzesheng commented 10 years ago

创建了一个新的Issue来跟踪这个事情,https://github.com/XiaoMi/minos/issues/18