dalibo / temboard

PostgreSQL Remote Control
https://labs.dalibo.com/temboard
Other
452 stars 54 forks source link

ERROR: Exception: Can't find host_id for ... in monitoring.hosts table. #289

Closed bsislow closed 6 years ago

bsislow commented 6 years ago

version info: temboard.noarch 1.2-1.el7.centos @temboard-rhel7 temboard-agent.noarch 1.2-1.el7.centos @temboard-rhel7

why are we seeing this? see below in bold.

------ agent

2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: HTTP Error 500: Internal Server Error 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: Traceback (most recent call last): 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib/python2.7/site-packages/temboardagent/plugins/monitoring/init.py", line 1042, in monitoring_sender_worker 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: msg.content) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib/python2.7/site-packages/temboardagent/plugins/monitoring/output.py", line 34, in send_output 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: data=json.loads(j_output) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib/python2.7/site-packages/temboardagent/httpsclient.py", line 93, in https_request 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: handle = url_opener.open(request) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib64/python2.7/urllib2.py", line 437, in open 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: response = meth(req, response) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: 'http', request, response, code, msg, hdrs) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib64/python2.7/urllib2.py", line 475, in error 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: return self._call_chain(args) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: result = func(args) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: File "/usr/lib64/python2.7/urllib2.py", line 558, in http_error_default 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) 2018-03-22 13:19:46,815 temboard-agent[26756]: [monitoring] ERROR: HTTPError: HTTP Error 500: Internal Server Error

----- server

2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: Can't find host_id for "xxxxxxxxxxxx" in monitoring.hosts table. 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: Traceback (most recent call last): 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: File "/usr/lib/python2.7/site-packages/temboardui/plugins/monitoring/init.py", line 712, in get_data_probe 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: host_id = get_host_id(self.db_session, instance.hostname) 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: File "/usr/lib/python2.7/site-packages/temboardui/plugins/monitoring/init.py", line 155, in get_host_id 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: " in monitoring.hosts table." % hostname) 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: Exception: Can't find host_id for "xxxxxxxxxx" in monitoring.hosts table. 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: Can't find host_id for "xxxxxxxxxxxx" in monitoring.hosts table. 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: Traceback (most recent call last): 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: File "/usr/lib/python2.7/site-packages/temboardui/plugins/monitoring/init.py", line 712, in get_data_probe 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: host_id = get_host_id(self.db_session, instance.hostname) 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: File "/usr/lib/python2.7/site-packages/temboardui/plugins/monitoring/init.py", line 155, in get_host_id 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: " in monitoring.hosts table." % hostname) 2018-03-22 13:20:27,457 temboard[26356]: [temboardui] ERROR: Exception: Can't find host_id for "xxxxxxxxxx" in monitoring.hosts table.

... and the table contains:

You are now connected to database "temboard" as user "temboard". temboard=# select * from monitoring.hosts; host_id | hostname | os | os_version | os_flavour | cpu_count | cpu_arch | memory_size | swap_size | virtual ---------+----------+----+------------+------------+-----------+----------+-------------+-----------+--------- (0 rows)

... yet we've registered an instance (removed IP and host name):

temboard=# select * from application.instances; agent_address | agent_port | agent_key | hostname | cpu | memory_size | pg_port | pg_version | pg_data ---------------+------------+------------------------------------------------------------------+-------------------------+-----+--------------+---------+-------------------------------------------- -------------------------------------------------------------+------------ xxxxxxxxx | 2345 | YCO3HVpgDY5avNrMg4246gPqvfBnPPQppv5mVrLKpwlPYUW43TTMVXu41gIYotGK | xxxxxxxxxx | 20 | 101331210240 | 45221 | PostgreSQL 10.1 on x86_64-pc-linux-gnu, com piled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), 64-bit | /pgdata/10 (1 row)

julmon commented 6 years ago

Hello,

Could you send me full server logs to julmon at gmail dot com please ?

Thanks,

dgiffin commented 6 years ago

Got the same problem on a new (and different) temboard install on my test server, I'll send you the logs Julmon.

dgiffin commented 6 years ago

Not sure if it's relevant but when I add a second test host into the temboard monitoring that seems to work fine, so MY problem with the monitoring.hosts table not being updated seems to be only when I try and add the host that the temboard server is running on, which was the the first one I did and had the problem with.

bsislow commented 6 years ago

this makes no sense whatsoever, but upon logging on this morning, the monitoring.hosts table NOW has data and we did not change anything overnight. is there some process that runs on a regular basis to scrape this data or something? can we tell when the row below was added?

we will send the full logs to you as a reference as well. thanks.

temboard=# select * from monitoring.hosts; host_id | hostname | os | os_version | os_flavour | cpu_count | cpu_arch | memory_size | swap_size | virtual ---------+-------------------------+-------+-----------------------+------------+-----------+----------+--------------+-----------+--------- 1 | xxxxxxxxxxx | Linux | 3.10.0-514.el7.x86_64 | | 20 | x86_64 | 101331210240 | |

julmon commented 6 years ago

Thanks for the logs @bsislow

ERROR: Exception: Can't find host_id for "<host.fqdn>" in monitoring.hosts table.

This error is raised when you try to browse monitoring pages when there is no monitoring data, meaning the agent is not able to send monitoring data to the server. In agent logs we can see this error:

2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)>
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR: Traceback (most recent call last):
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib/python2.7/site-packages/temboardagent/plugins/monitoring/__init__.py", line 1042, in monitoring_sender_worker
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     msg.content)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib/python2.7/site-packages/temboardagent/plugins/monitoring/output.py", line 34, in send_output
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     data=json.loads(j_output)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib/python2.7/site-packages/temboardagent/httpsclient.py", line 93, in https_request
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     handle = url_opener.open(request)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib64/python2.7/urllib2.py", line 431, in open
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     response = self._open(req, data)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     '_open', req)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     result = func(*args)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib/python2.7/site-packages/temboardagent/httpsclient.py", line 34, in https_open
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     return self.do_open(self.specialized_conn_class, req)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:   File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR:     raise URLError(err)
2018-03-22 10:41:53,269 temboard-agent[19683]: [monitoring] ERROR: URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)>

In this case, the issue is due to SSL cert. check. To get ride of SSL cert check, you'll need to comment ssl_ca_cert_file in both temboard.conf and temboard-agent.conf.

In temboard.conf:

[temboard]
...
# ssl_ca_cert_file = temboard_ca_certs_CHANGEME.pem
...

In temboard-agent.conf:

...
[monitoring]
...
# ssl_ca_cert_file = temboard-agent_ca_certs_CHANGEME.pem

ERROR: Exception: Can't find the instance "<host.fqdn>" in application.instances table.

This error means temboard server can't find any instance with this hostname in application.instances table. To fix this, you should edit instance informations through temboard UI (Settings -> Instances -> Edit) and change instance hostname to match with the hostname found by the agent. To be sure about what hostname the agent has, you can try this:

$ curl -s -k https://localhost:2345/discover | python -m json.tool
{
    "cpu": 1,
    "hostname": "temboard.agent.dev",
    "memory_size": 1040527360,
    "pg_data": "/var/lib/pgsql/10/data",
    "pg_port": "5432",
    "pg_version": "PostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), 64-bit",
    "plugins": [
        "monitoring",
        "pgconf",
        "administration",
        "dashboard",
        "activity"
    ]
}

About hostname

Few words about hostname:

Regards,

julmon commented 6 years ago

cc @dgiffin

bsislow commented 6 years ago

this is excellent information. the issue:

ERROR: Exception: Can't find the instance "" in application.instances table.

... was the primary concern. matching this to the hostname found by the agent should be the appropriate solution. please leave this open so dgiffin can take a look at his configuration. thank you.

bsislow commented 6 years ago

actually, we're not quite done here...

if i disable the cert for temboard server, i see the following in temboard server log and i cannot reach the web interface after a restart:

FATAL: coercing to Unicode: need string or buffer, NoneType found
  File "/usr/bin/temboard", line 249, in <module>
    main()
  File "/usr/bin/temboard", line 232, in main
    server = AutoHTTPSServer(application, ssl_options=ssl_ctx)
  File "/usr/lib64/python2.7/site-packages/tornado/util.py", line 221, in __new__
    instance.initialize(*args, **init_kwargs)
  File "/usr/lib64/python2.7/site-packages/tornado/httpserver.py", line 154, in initialize
    read_chunk_size=chunk_size)
  File "/usr/lib64/python2.7/site-packages/tornado/tcpserver.py", line 109, in __init__
    if not os.path.exists(self.ssl_options['certfile']):
  File "/usr/lib64/python2.7/genericpath.py", line 18, in exists
    os.stat(path)
bsislow commented 6 years ago

i restarted temboard server with the ssl_ca_cert_file uncommented and it works. i left the agent commented.

i noticed now that we see the "front page" charts empty, but the charts are populated for each host when we log into the agents.

for example, front page - and of course i removed the server names and detail:

image

if you look at the monitoring page for the first server, however, it looks like this:

image

... so there's obviously load, etc.

what do you think is going on here?

the temboard server log does not have any errors when i refresh the "home" screen for the overall health view of all servers... (i removed IPs)

2018-03-26 15:33:00,167 temboard[48145]: [temboardui] INFO: Loading home.
2018-03-26 15:33:00,195 temboard[48145]: [temboardui] INFO: Done.
2018-03-26 15:33:00,197 temboard[48145]: [temboardui] INFO: 200 GET /home (10.244.244.124) 31.19ms
2018-03-26 15:33:00,272 temboard[48145]: [temboardui] INFO: 200 GET /css/timeline.css (10.244.244.124) 1.12ms
2018-03-26 15:33:00,291 temboard[48145]: [temboardui] INFO: 200 GET /css/sb-admin-2.css (10.244.244.124) 0.63ms
2018-03-26 15:33:00,318 temboard[48145]: [temboardui] INFO: 200 GET /css/font-awesome.min.css (10.244.244.124) 0.65ms
2018-03-26 15:33:00,375 temboard[48145]: [temboardui] INFO: 200 GET /css/bootstrap.xl.min.css (10.244.244.124) 0.84ms
2018-03-26 15:33:00,402 temboard[48145]: [temboardui] INFO: 200 GET /css/bootstrap-toggle.min.css (10.244.244.124) 0.57ms
2018-03-26 15:33:00,425 temboard[48145]: [temboardui] INFO: 200 GET /css/bootstrap-tagsinput.css (10.244.244.124) 1.14ms
2018-03-26 15:33:00,448 temboard[48145]: [temboardui] INFO: 200 GET /css/bootstrap-multiselect.css (10.244.244.124) 0.63ms
2018-03-26 15:33:00,480 temboard[48145]: [temboardui] INFO: 200 GET /css/bootstrap-tagsinput-typeahead.css (10.244.244.124) 0.58ms
2018-03-26 15:33:00,505 temboard[48145]: [temboardui] INFO: 200 GET /css/dataTables.bootstrap.min.css (10.244.244.124) 0.70ms
2018-03-26 15:33:00,508 temboard[48145]: [temboardui] INFO: 200 GET /css/dataTables.fontAwesome.css (10.244.244.124) 0.63ms
2018-03-26 15:33:00,517 temboard[48145]: [temboardui] INFO: 200 GET /codemirror/lib/codemirror.css (10.244.244.124) 0.70ms
2018-03-26 15:33:00,642 temboard[48145]: [temboardui] INFO: 200 GET /css/temboard.css (10.244.244.124) 126.33ms
2018-03-26 15:33:00,909 temboard[48145]: [temboardui] INFO: 200 GET /server/xxxxxxxx.xxx/2345/monitoring/data/tps?start=2018-03-26T19:33:00.758Z&end=2018-03-26T20:33:00.759Z (10.244.244.124) 129.14ms
2018-03-26 15:33:00,910 temboard[48145]: [temboardui] INFO: 200 GET /server/xxxxxxxx.xxx/2345/monitoring/data/tps?start=2018-03-26T19:33:00.770Z&end=2018-03-26T20:33:00.770Z (10.244.244.124) 100.22ms
2018-03-26 15:33:00,911 temboard[48145]: [temboardui] INFO: 200 GET /server/xxxxxxxx.xxx/2345/monitoring/data/loadavg?interval=load5&start=2018-03-26T19:33:00.758Z&end=2018-03-26T20:33:00.759Z (10.244.244.124) 108.58ms
2018-03-26 15:33:00,912 temboard[48145]: [temboardui] INFO: 200 GET /server/xxxxxxxx.xxx/2345/monitoring/data/tps?start=2018-03-26T19:33:00.767Z&end=2018-03-26T20:33:00.767Z (10.244.244.124) 79.42ms
2018-03-26 15:33:00,913 temboard[48145]: [temboardui] INFO: 200 GET /server/xxxxxxxx.xx/2345/monitoring/data/loadavg?interval=load5&start=2018-03-26T19:33:00.770Z&end=2018-03-26T20:33:00.770Z (10.244.244.124) 84.99ms
2018-03-26 15:33:00,918 temboard[48145]: [temboardui] INFO: 200 GET /server/xxxxxxxx.xxx/2345/monitoring/data/loadavg?interval=load5&start=2018-03-26T19:33:00.767Z&end=2018-03-26T20:33:00.767Z (10.244.244.124) 85.72ms
julmon commented 6 years ago

@bsislow according to the backtrace you attached, it seems you've commented ssl_cert_file parameter but I was talking about commenting ssl_cert_ca_file parameter only. You can refer to https://github.com/dalibo/temboard-agent/blob/master/rpm/temboard-agent.rpm.conf and https://github.com/dalibo/temboard/blob/master/rpm/temboard.rpm.conf for the right SSL configuration working out of the box for RPM packages.

bsislow commented 6 years ago

ok i commented out ssl_cert_ca_file now on temboard server and all agents that have been deployed.

the front page charts are still empty as of this time for all 3 hosts.

julmon commented 6 years ago

@bsislow about empty charts on home page, I'm getting the same issue with an old firefox (44.0). A new ticket has been opened: #292

Let's see if @pgiraud can fix this :)

bsislow commented 6 years ago

i'm getting it on firefox 59.0.1 and chrome 65.0.3325.162

julmon commented 6 years ago

Me too with firefox 59.0, but works fine with chrome/chromium 64.0

bsislow commented 6 years ago

FYI we have a temboard server instance in EU and in the US. the EU instance was upgraded to 1.2 and the US one is a fresh install at 1.2. the front page charts work on the upgraded EU instance but not on the new US instance. anything we can compare?

julmon commented 6 years ago

I think this bug is due to date format. It seems that the timezone offset (if any) could not be parsed: https://github.com/dalibo/temboard/issues/292#issuecomment-376324279 We're going to fix this asap.

Thank you for the report!

julmon commented 6 years ago

@bsislow temboard server 1.2.1 has been released.

Upgrade procedure with RPM:

sudo yum install https://packages.temboard.io/yum/rhel7/temboard-1.2.1-1.el7.centos.noarch.rpm
sudo systemctl restart temboard

Regards,

bsislow commented 6 years ago

@julmon - 1.2.1 fixed this!

bsislow commented 6 years ago

we still have an issue with one of our server's temBoard agent. the front page dash still shows nothing at all. the configuration is exactly the same as the others. what else can we check? the logs show nothing relevant and no ERRORs when navigating to the front page.

julmon commented 6 years ago

@bsislow could you please open a new issue for this ?

Thanks,

bsislow commented 6 years ago

this was our mistake; we had two identical entries and this is why the chart did not show on the front page. however, you may want to build a check in to ensure users are not doing this based on host name AND port so the combination is a unique key.

you may close this case. thanks for your help!