ceph / calamari

Web-based monitoring and management for Ceph
Other
348 stars 178 forks source link

Server Error 500 : ERROR - django.request Internal Server Error #309

Open ksingh7 opened 9 years ago

ksingh7 commented 9 years ago

Hello Developers

Could you extend your help in fixing this issue

image

Calamari.log

2015-06-23 13:53:41,317 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_used_bytes
2015-06-23 13:53:41,329 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_used
2015-06-23 13:53:41,330 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_space
2015-06-23 13:53:41,330 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_avail
2015-06-23 13:53:41,394 - ERROR - django.request Internal Server Error: /api/v1/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc/health_counters
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py", line 78, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 94, in dispatch
    self.client.close()
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 292, in close
    ClientBase.close(self)
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 194, in close
    self._multiplexer.close()
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/channel.py", line 61, in close
    self._channel_dispatcher_task.kill()
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/greenlet.py", line 235, in kill
    waiter.get()
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 575, in get
    return self.hub.switch()
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 338, in switch
    return greenlet.switch(self)
LostRemote: Lost remote after 10s heartbeat

My environment details

[root@ceph-node1 ~]# rpm -qa | grep -i supervisor
supervisor-3.0-1.el7.noarch
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# rpm -qa | grep -i calamari
calamari-server-1.3.0.1-49_g828960a.el7.centos.x86_64
calamari-clients-1.2.2-32_g931ee58.el7.centos.x86_64
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# ceph -v
ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# cat /etc/redhat-release
CentOS Linux release 7.0.1406 (Core)
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# rpm -qa | grep -i salt
salt-2015.5.0-1.el7.noarch
salt-master-2015.5.0-1.el7.noarch
salt-minion-2015.5.0-1.el7.noarch
[root@ceph-node1 ~]#
ksingh7 commented 9 years ago

IRRC this is one of the most annoying error message with calamari dashboard , i have been watching this error over mailing lists since very long. Most of the people involved with calamari have seen this during their stint with calamari.

So i hope we can fix this once for all ( #dream )

joehandzik commented 9 years ago

This looks...possibly relevant. From: https://github.com/ceph/calamari/blob/c64121ab01aef0be6dfc3bef1940e21fe09af45f/rest-api/calamari_rest/views/v1.py#L58

# In case the cluster has been offline for some time, try looking progressively
# further back in time for data.  This would not be necessary if graphite simply
# let us ask for the latest value (Calamari issue #6876)
for trange in ['-1min', '-10min', '-60min', '-1d', '-7d']:
    val = _get(parseATTime(trange, tzinfo))
    if val is not None:
        return val
joehandzik commented 9 years ago

I could at least see that causing the timeout. Now, what the correct workaround is...looks like the calamari guys would like graphite's functionality expanded a bit. A quick hack to try would be to shorten the trange drastically, or remove that for loop at all. If you Stop timing out, it seems like the only repercussion would be that you'd lose the data that it's logging about. Not sure if that would be a tragic loss to you or not.

dmick commented 9 years ago

All "500" means is "something went wrong". It's basically what happens when anything in a hugely complicated multistep process fails.

dmick commented 9 years ago

chances are good all this means is that something's broken in cthulhu talking to the cluster or running on its own. basic troubleshooting:

1) does salt work to the minions 2) is cthulhu running without errors; check all logs in /var/log/calamari 3) increase cthulhu's debug level in calamari.conf 4) try talking to the Calamari API directly from the browser while watching the logs

joehandzik commented 9 years ago

Tracing through the code, it looks like this is the guy instigating the graphite operations that are being logged about:

https://github.com/ceph/calamari/blob/c64121ab01aef0be6dfc3bef1940e21fe09af45f/rest-api/calamari_rest/views/v1.py#L100

Looks like he should just handle the case where the missing requests fail. Seems suspicious that a timeout log is coincident with actual failures to recover those values. @ksingh7, definitely give @dmick's suggestions a shot.

ksingh7 commented 9 years ago

@dmick Thanks for you answer

History : Calamari server , client and diamond package build was successful. Post that initial calamari configuration including salt-keys stuff went fine. However calamari-ctl initialize gave some errors but finally it worked.

The dashboard was working nicely till i added first node , when i added the remaining nodes to calamari ( salt-minion --> diamond --> salt-key -A ) the dashboard broke and throw this error.

1) Yes salt-master and minion works

[root@ceph-node1 views]# salt-key -L
Accepted Keys:
ceph-node1
ceph-node2
ceph-node3
Denied Keys:
Unaccepted Keys:
Rejected Keys:
[root@ceph-node1 views]#

2) cthulhu is running BUT with errors @dmick you thought it right

[root@ceph-node1 views]# supervisorctl status
carbon-cache                     RUNNING    pid 29279, uptime 0:37:26
cthulhu                          RUNNING    pid 29284, uptime 0:37:18
[root@ceph-node1 views]#

Repeatedly getting these messages in cthulhu.log

2015-06-23 22:49:46,088 - WARNING - cthulhu Abandoning fetch for mon_map started at 2015-06-23 19:48:54.057496+00:00
2015-06-23 22:49:46,088 - ERROR - cthulhu Exception handling message with tag ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in _run
    self.on_heartbeat(data['id'], data['data'])
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped
    return func(*args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat
    cluster_data['versions'][sync_type.str])
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version
    self.fetch(reported_by, sync_type)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch
    client = LocalClient(config.get('cthulhu', 'salt_config_path'))
  File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 126, in __init__
    self.opts = salt.config.client_config(c_path)
  File "/usr/lib/python2.7/site-packages/salt/config.py", line 2176, in client_config
  File "/usr/lib/python2.7/site-packages/salt/utils/xdg.py", line 13, in xdg_config_dir
  File "/opt/calamari/venv/lib64/python2.7/posixpath.py", line 269, in expanduser
KeyError: 'getpwuid(): uid not found: 0'
2015-06-23 22:49:56,392 - WARNING - cthulhu Abandoning fetch for osd_map started at 2015-06-23 19:49:36.709092+00:00
2015-06-23 22:49:56,393 - ERROR - cthulhu Exception handling message with tag ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
2015-06-23 22:53:49,138 - WARNING - cthulhu Abandoning fetch for mon_map started at 2015-06-23 19:53:19.073702+00:00
2015-06-23 22:53:49,288 - ERROR - cthulhu Exception handling message with tag ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in _run
    self.on_heartbeat(data['id'], data['data'])
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped
    return func(*args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat
    cluster_data['versions'][sync_type.str])
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version
    self.fetch(reported_by, sync_type)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch
    client = LocalClient(config.get('cthulhu', 'salt_config_path'))
  File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 136, in __init__
    listen=not self.opts.get('__worker', False))
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 112, in get_event
    return MasterEvent(sock_dir or opts.get('sock_dir', None))
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 510, in __init__
    super(MasterEvent, self).__init__('master', sock_dir)
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 176, in __init__
    self.get_event(wait=1)
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 361, in get_event
    ret = self._get_event(wait, tag, pending_tags)
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 305, in _get_event
    socks = dict(self.poller.poll(wait * 1000))
  File "/opt/calamari/venv/lib/python2.7/site-packages/zmq/green/poll.py", line 81, in poll
    select.select(rlist, wlist, xlist)
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/select.py", line 68, in select
    result.event.wait(timeout=timeout)
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/event.py", line 77, in wait
    result = self.hub.switch()
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 337, in switch
    switch_out()
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 15, in asserter
    raise ForbiddenYield("Context switch during `nosleep` region!")
ForbiddenYield: Context switch during `nosleep` region!

4) Some of the API commands are working with , but some dont

image image image

ksingh7 commented 9 years ago

@dmick

3) I increased logging level on cthulhu

What i found is , when it goes to ceph-node2 / ceph-node3 it cannot get cluster data and throws message cthulhu Ignoring cluster data from ceph-node2, it is not my favourite (ceph-node1)

Hope these logs can point us something

2015-06-24 00:33:08,271 - DEBUG - cthulhu _run.ev: ceph-node2/tag=ceph/server
2015-06-24 00:33:08,272 - DEBUG - cthulhu.server_monitor ServerMonitor got ceph/server message from ceph-node2
2015-06-24 00:33:08,272 - DEBUG - cthulhu.server_monitor ServerMonitor.on_server_heartbeat: ceph-node2
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='osd', service_id='5')
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='osd', service_id='4')
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='mon', service_id='ceph-node2')
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='osd', service_id='3')
2015-06-24 00:33:08,274 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='mds', service_id='ceph-node2')
2015-06-24 00:33:08,275 - DEBUG - cthulhu TopLevelEvents: ignoring ceph/server
2015-06-24 00:33:08,326 - DEBUG - cthulhu _run.ev: ceph-node2/tag=ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
2015-06-24 00:33:08,327 - DEBUG - cthulhu Ignoring cluster data from ceph-node2, it is not my favourite (ceph-node1)
2015-06-24 00:33:08,329 - DEBUG - cthulhu TopLevelEvents: heartbeat from existing cluster 9609b429-eee2-4e23-af31-28a24fcf5cbc
ChristinaMeno commented 9 years ago

@ksingh7 The latest cthulhu logs are not the source of the problem. the previous ones are: ForbiddenYield: Context switch during nosleep region!

KeyError: 'getpwuid(): uid not found: 0' This error might be worth tracking down

JackZielke commented 9 years ago

I made a change trying to fix the getpwuid problem. I have not seen it again so this might have worked.

/etc/apparmor.d/abstractions/python I added 1 line to the end /etc/passwd r,

EDIT: That did not help. I am still getting the getpwuid error.

ivanoch79 commented 9 years ago

When I see the error 500 on the screen at the same time the logs are showing up getpwuid errors, it works for a while then ctulhu starts trowing KeyError: 'getpwuid(): uid not found: 0' If i kill -HUP the process it works again for some time until the same error starts showing up on the logs

Also when cthulhu-manager hangs and 500 error starts showing up on the screen I did an lsof on the pid there is a pretty big number for anon_inode around 800 this number gradually increases until cpu utilization goes to 100 % and the 500 error starts showing up on the screen

Is there any fix for this ? I've applied the patch for salt_wrapper.py from git to work with 2015.5.2

Thanks

2015-07-15 04:12:45,765 - ERROR - cthulhu Exception handling message with tag ceph/cluster/db9c01f8-14a6-11e5-8515-2e924e5027c2 Traceback (most recent call last): File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in _run self.on_heartbeat(data['id'], data['data']) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped return func(_args, *_kwargs) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat cluster_data['versions'][sync_type.str]) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version self.fetch(reported_by, sync_type) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch client = LocalClient(config.get('cthulhu', 'salt_config_path')) File "/usr/lib/python2.7/dist-packages/salt/client/init.py", line 126, in init self.opts = salt.config.client_config(c_path) File "/usr/lib/python2.7/dist-packages/salt/config.py", line 2180, in client_config File "/usr/lib/python2.7/dist-packages/salt/utils/xdg.py", line 13, in xdg_config_dir File "/opt/calamari/venv/lib/python2.7/posixpath.py", line 269, in expanduser KeyError: 'getpwuid(): uid not found: 0'

dmick commented 9 years ago

Does your system really not have a uid 0 account installed?

On 07/14/2015 09:14 PM, Ivan wrote:

I can see the error 500 on the screen on the same time that on the logs, for me it works for a while then ctulhu starts trowing KeyError: 'getpwuid(): uid not found: 0' If i kill -HUP the process it works again for some time

Is there any fix for this ? I've applied the patch for salt_wrapper.py from git to work with 2015.5.2

2015-07-15 04:12:45,765 - ERROR - cthulhu Exception handling message with tag ceph/cluster/db9c01f8-14a6-11e5-8515-2e924e5027c2 Traceback (most recent call last): File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in /run self.on_heartbeat(data['id'], data['data']) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/geventutil.py", line 35, in wrapped return func(/args, /kwargs) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat cluster_data['versions'][sync_type.str]) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version self.fetch(reported_by, sync_type) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch client = LocalClient(config.get('cthulhu', 'salt_config_path')) File "/usr/lib/python2.7/dist-packages/salt/client/_init/.py", line 126, in *init_ self.opts = salt.config.client_config(c_path) File "/usr/lib/python2.7/dist-packages/salt/config.py", line 2180, in client_config File "/usr/lib/python2.7/dist-packages/salt/utils/xdg.py", line 13, in xdg_config_dir File "/opt/calamari/venv/lib/python2.7/posixpath.py", line 269, in expanduser KeyError: 'getpwuid(): uid not found: 0'

— Reply to this email directly or view it on GitHub https://github.com/ceph/calamari/issues/309#issuecomment-121482356.

ivanoch79 commented 9 years ago

Ubuntu 14.04.2 LTS

We are using ubuntu 14, that error KeyError: 'getpwuid(): uid not found: 0' doesn't show up until the 'anon_inode' count gets really high

root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 288 root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 295 root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 309 root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 337

cthulhu-m 22427 root 351u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 352u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 353u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 354u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 355u 0000 0,9 0 6289 anon_inode

root@radosgw-openstack-01:/opt/calamari# id uid=0(root) gid=0(root) groups=0(root)

root@cephnode02:~# id uid=0(root) gid=0(root) groups=0(root)

ksingh7 commented 9 years ago

@ivanoch79 you said you have applied a patch from github to work with salt 2015.5.2. Which patch is this ? Are you sure it worked for you.

I mean if you are still facing this error try salt 2014 version instead.

joehandzik commented 9 years ago

I had been puttering around with fixing all the parts that are incompatible with salt 2015.*, but I hit a wall recently and started encountering more confusing problems. The best advice I can give is to drop back to 2014 until we can spend more time triage my everything that changed in 2015.

Joe

On Jul 15, 2015, at 1:25 AM, karan singh notifications@github.com<mailto:notifications@github.com> wrote:

@ivanoch79https://github.com/ivanoch79 you said you have applied a patch from github to work with salt 2015.5.2. Which patch is this ? Are you sure it worked for you.

I mean if you are still facing this error try salt 2014 version instead.

� Reply to this email directly or view it on GitHubhttps://github.com/ceph/calamari/issues/309#issuecomment-121504013.

wyllys66 commented 9 years ago

This is indeed nasty, I wasted half a day chasing down these exact issues before I found this thread. I finally reverted my saltstack back to 2014.7.4 and now it all works as expected.

andral commented 8 years ago

Where did you guys find salt 2014 packages? The official repo only has 2015s available..

https://repo.saltstack.com/yum/redhat/7/x86_64/

tserong commented 8 years ago

Has anyone tried salt 2015.8?

I tried 2015.5 today and immediately got a pile of 500's and "ForbiddenYield: Context switch during nosleep region!" messages, but after upgrading to salt 2015.8, this problem seems to have evaporated.

jerkyrs commented 8 years ago

I have same error on 2015.8 , problem is as post above says 2014 is not available in EPEL. I found them however on this Korean mirror probably not using rsync --delete (good luck downloading). I installed 2014.7.5 on all nodes and calamari server.

http://mirror.oasis.onnetcorp.com/epel/testing/7/x86_64/s/

After fixing this error it now leads onto the next one..

jerkyrs commented 8 years ago

Just to clarify this you actually need 2014.1.11 for it to work otherwise the cluster is not found in calamari. I did see reference above to 2014.7.4 , i initially tried 2014.7.5 and did not work (no cluster found). On installing 2014.1.11 i noticed it installed two dependencies (python-libcloud and sshpass). Not sure if there is something in this that made it work or not, have not tested upgrading to latest salt or newer variant to validate.

There is reference to the same by http://lists.ceph.com/pipermail/ceph-calamari-ceph.com/2015-July/000236.html

You may also have to give it a kick with the following

salt * ceph.heartbeat