XiaoMi / minos

Minos is beyond a hadoop deployment system.
Apache License 2.0
522 stars 200 forks source link

owl监控问题请教 #38

Closed zengzhaozheng closed 9 years ago

zengzhaozheng commented 9 years ago

我的owl启动安装都已经ok了?但是打开监控页面的时候什么都检测不到HDFS和其他的监控指标?请问我还要配置什么? 我的/data/hadoop/z.zeng/minos-master/config/owl/collector.cfg内容如下:

collector config

[collector] services=hdfs hbase yarn impala

Period to fetch/report metrics, in seconds.

period=10

[hdfs] clusters=dptst-example jobs=journalnode namenode datanode

The jmx output of each bean is as following:

{

"name" : "hadoop:service=RegionServer,name=RegionServerDynamicStatistics",

"modelerType" : "org.apache.hadoop.hbase.regionserver.metrics.RegionServerDynamicStatistics",

"tbl.YCSBTest.cf.test.blockCacheNumCached" : 0,

"tbl.YCSBTest.cf.test.compactionBlockReadCacheHitCnt" : 0,

...

Some metrics/values are from hjadoop/hbase and some are from java run time

environment, we specify a filter on jmx url to get hadoop/hbase metrics.

metric_url=/jmx?qry=Hadoop:*

metric_url=http://sx-master:50070/jmx?qry=Hadoop:*

[hbase] clusters=dptst-example jobs=master regionserver metric_url=/jmx?qry=hadoop:*

[yarn] clusters=dptst-example jobs=resourcemanager nodemanager historyserver proxyserver metric_url=/jmx?qry=Hadoop:*

[impala] clusters=dptst-example jobs=statestored impalad metric_url=/ need_analyze=false

YxAc commented 9 years ago

能贴一段owl/collector.log的内容么?metrics的收集log在这个文件中; 点击metrics页面展示的log在owl/server.log以及owl/debug.log中

zengzhaozheng commented 9 years ago

options: {'collector_cfg': 'collector.cfg', 'settings': None, 'use_threadpool': False, 'pythonpath': None, 'verbosity': u'1', 'traceback': None, 'no_color': False, 'clear_oldtasks': False} INFO 2014-11-25 14:06:51,244 collect 63152 139699130095360 <Task: hdfs/dptst-example/journalnode/0> waiting 6.827905 seconds for http://localhost:12101/jmx?qry=Hadoop:... INFO 2014-11-25 14:06:51,245 collect 63152 139699130095360 <Task: hdfs/dptst-example/journalnode/1> waiting 6.396616 seconds for http://localhost:12101/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,245 collect 63152 139699130095360 <Task: hdfs/dptst-example/journalnode/2> waiting 0.373451 seconds for http://localhost:12101/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,245 collect 63152 139699130095360 <Task: hdfs/dptst-example/namenode/0> waiting 6.239103 seconds for http://localhost:12201/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,245 collect 63152 139699130095360 <Task: hdfs/dptst-example/namenode/1> waiting 7.122569 seconds for http://localhost:12201/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hdfs/dptst-example/datanode/0> waiting 1.824659 seconds for http://localhost:12401/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hdfs/dptst-example/datanode/1> waiting 6.874303 seconds for http://localhost:12401/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hdfs/dptst-example/datanode/2> waiting 0.431661 seconds for http://localhost:12411/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hdfs/dptst-example/datanode/3> waiting 1.287644 seconds for http://localhost:12401/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hdfs/dptst-example/datanode/4> waiting 7.967131 seconds for http://localhost:12411/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hdfs/dptst-example/datanode/5> waiting 7.407624 seconds for http://localhost:12421/jmx?qry=Hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hbase/dptst-example/master/0> waiting 4.281206 seconds for http://localhost:12501/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,246 collect 63152 139699130095360 <Task: hbase/dptst-example/master/1> waiting 3.529428 seconds for http://localhost:12501/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,247 collect 63152 139699130095360 <Task: hbase/dptst-example/regionserver/0> waiting 5.092863 seconds for http://localhost:12601/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,247 collect 63152 139699130095360 <Task: hbase/dptst-example/regionserver/1> waiting 6.544828 seconds for http://localhost:12601/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,247 collect 63152 139699130095360 <Task: hbase/dptst-example/regionserver/2> waiting 2.135464 seconds for http://localhost:12611/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,247 collect 63152 139699130095360 <Task: hbase/dptst-example/regionserver/3> waiting 3.166893 seconds for http://localhost:12601/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,247 collect 63152 139699130095360 <Task: hbase/dptst-example/regionserver/4> waiting 5.212955 seconds for http://localhost:12611/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,247 collect 63152 139699130095360 <Task: hbase/dptst-example/regionserver/5> waiting 0.850794 seconds for http://localhost:12621/jmx?qry=hadoop:_... INFO 2014-11-25 14:06:51,247 collect 63152 139699130095360 <Task: impala/dptst-example/statestored/0> waiting 5.578948 seconds for http://localhost:21301/... 这个有点郁闷,为什么是 http://localhost:12621/jmx?qry=hadoop:_这个jmx的url的,我已经重新配置了

zengzhaozheng commented 9 years ago

请问修改完/data/hadoop/z.zeng/minos-master/config/owl/collector.cfg之后需要重启哪些进程的?

zengzhaozheng commented 9 years ago

我这个是只装了owl的,没有装tank和supervisor的。

zengzhaozheng commented 9 years ago

owl/debug.log出现了一段错误: ERROR 2014-11-25 14:18:45,279 base 3404 139814237136640 Internal Server Error: /failover/ Traceback (most recent call last): File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/core/handlers/base.py", line 111, in get_response response = wrapped_callback(request, _callback_args, _callback_kwargs) File "/data/hadoop/z.zeng/minos-master/owl/failover_framework/views.py", line 29, in index hour_task_number = Task.objects.filter(start_timestamp__gt=previous_hour).count() File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/models/query.py", line 338, in count return self.query.get_count(using=self.db) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/models/sql/query.py", line 424, in get_count number = obj.get_aggregation(using=using)[None] File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/models/sql/query.py", line 390, in get_aggregation result = query.get_compiler(using).execute_sql(SINGLE) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 786, in execute_sql cursor.execute(sql, params) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/backends/utils.py", line 65, in execute return self.cursor.execute(sql, params) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/utils.py", line 94, in exit six.reraise(dj_exc_type, dj_exc_value, traceback) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/backends/utils.py", line 65, in execute return self.cursor.execute(sql, params) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 128, in execute return self.cursor.execute(query, args) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute self.errorhandler(self, exc, value) File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue ProgrammingError: (1146, "Table 'hadoop_owl.failover_framework_task' doesn't exist") ERROR 2014-11-25 14:52:54,078 base 11619 139746212833024 Internal Server Error: /monitor/table/count_rows/ Traceback (most recent call last): File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/core/handlers/base.py", line 111, in get_response response = wrapped_callback(request, _callback_args, _callback_kwargs) File "/data/hadoop/z.zeng/minos-master/owl/monitor/views.py", line 583, in show_table_count_rows 'count_period': settings.COUNT_PERIOD, File "/data/hadoop/z.zeng/minos-master/build/env/lib/python2.7/site-packages/django/conf/init.py", line 47, in getattr return getattr(self._wrapped, name) AttributeError: 'Settings' object has no attribute 'COUNT_PERIOD'

说这个表hadoop_owl.failover_framework_task不存在。(这个是我点击页面上failover选项卡的时候报的)

zengzhaozheng commented 9 years ago

在现有集群上只用owl,不用tank和supervivor真可以?这种情况下需不需要动手配置 hdfs-dptst-example.cfg?

YxAc commented 9 years ago
  1. http://localhost:12621/jmx?qry=hadoop:这个jmx的url的,我已经重新配置了,, 这个主要看读取的配置文件,例如hdfs-dptst-example.cfg中配置的host
  2. 请问修改完/data/hadoop/z.zeng/minos-master/config/owl/collector.cfg之后需要重启哪些进程的? 如果修改了这个,或者修改了集群的配置文件,都需要重启owl目录下的collector.sh
  3. 说这个表hadoop_owl.failover_framework_task不存在。 这方面,我们内部有很多定制,可能会有一些混淆,可以直接在owl/owl/settings.py中删除这个failover的模块
  4. owl中的核心是收集metrics进行展示,其他的模块会需要tank等,但是collector不需要,它主要是根据配置文件collector.cfg中配置的url和集群配置文件 hdfs-dptst-example.cfg中的host去收集metrics,这期间是不涉及tank和supervisor的,应该是可行的;
zengzhaozheng commented 9 years ago

请问下,哪个脚本可以停止collector的?

YxAc commented 9 years ago

开源出去的没有,这个做的还不完善; @zengzhaozheng 直接ps -ef | grep collect 来kill吧,然后重新启动吧 :-)

开源版本各个daemon进程的管理还是比较乱的,owl方面开源版本确实好长时间没有维护了,我们内部是使用supervisord来管理owl后面所有进程的,后续等有时间或有人手了,会把owl开源的部分再完善修复下 :-)

zengzhaozheng commented 9 years ago

datanode 3 sx-slave4:12201 datanode 4 sx-slave5:12201 datanode 5 sx-slave6:12201 这些我在界面上面看到的信息。其中端口12201是指datanode的HTTP服务器和端口吗?

zengzhaozheng commented 9 years ago

config/conf/hdfs/hdfs-dptst-example.cfg 这里边设置的端口必须为100的倍数,我单独用owl的时候,对应的datanode、namenode的http端口都要改,这个比较麻烦。

YxAc commented 9 years ago
  1. 其中端口12201是指datanode的HTTP服务器和端口吗? 是,默认http端口会在base_port上加1(这个因为我们这边用supervisord来起集群进程,启动时就设置了http的端口base_port+1)
  2. 我单独用owl的时候,对应的datanode、namenode的http端口都要改,这个比较麻烦。 如果你们的http端口相对于base_port有规律的话,可以统一hack下 :-) 1)一方面改一下100倍数的那个逻辑 2)改一下owl页面上task entry的显示