Closed spalger closed 5 years ago
Looks like the issue was that some of the containers I had created were busted and were constantly restarting. This must have caused an error when metricbeat tried to read something.
When I updated the containers to remove the busted ones I got the following in my logs:
2017/02/13 19:46:48.065754 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065750 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065795 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065910 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.067729 log.go:144: ERR recovered from panic while fetching 'docker/diskio' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.067922 log.go:145: ERR Stacktrace: goroutine 76 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420584f80)
/usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc4203ef380, 0x53)
/go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420585368, 0xc420a78090)
/go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/diskio.(*BLkioService).getBlkioStats(0xc4204f00c8, 0xc420585368, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/helper.go:52 +0x14f
github.com/elastic/beats/metricbeat/module/docker/diskio.(*BLkioService).getBlkioStatsList(0xc4204f00c8, 0xc42058d000, 0xd, 0xd, 0x0, 0x0, 0xc4284ecd42)
/go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/helper.go:40 +0xe2
github.com/elastic/beats/metricbeat/module/docker/diskio.(*MetricSet).Fetch(0xc4204801b0, 0x284ecd42, 0x11594c0, 0x40d84f, 0x115d5c8, 0x7f3c9df5ca08)
/go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/diskio.go:54 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420232e40, 0x7f3c9df5ca08, 0xc4204801b0, 0xc420105c70, 0x1, 0xc4203ef380, 0x53, 0xc420105ea4)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420232e40, 0xc420496720, 0xc420496840, 0x0, 0x0)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420232e40, 0xc420496720, 0xc420496840)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420232e40)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:48.067742 log.go:144: ERR recovered from panic while fetching 'docker/memory' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.068297 log.go:145: ERR Stacktrace: goroutine 79 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420250c78)
/usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc42034a2a0, 0x53)
/go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0xc4207300d1, 0x5)
/go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420250ee8, 0xc42001a5d0)
/go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/memory.(*MemoryService).GetMemoryStats(0x1176288, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/elastic/beats/metricbeat/module/docker/memory/helper.go:34 +0x60
github.com/elastic/beats/metricbeat/module/docker/memory.(*MemoryService).getMemoryStatsList(0x1176288, 0xc420766000, 0xd, 0xd, 0x0, 0x0, 0xc4284cb1cf)
/go/src/github.com/elastic/beats/metricbeat/module/docker/memory/helper.go:25 +0x104
github.com/elastic/beats/metricbeat/module/docker/memory.(*MetricSet).Fetch(0xc420480360, 0x284cb1cf, 0x11594c0, 0x40d84f, 0x115db50, 0x7f3c9df5cbb8)
/go/src/github.com/elastic/beats/metricbeat/module/docker/memory/memory.go:52 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420233b40, 0x7f3c9df5cbb8, 0xc420480360, 0xc420073c70, 0x1, 0xc42034a2a0, 0x53, 0xc420073ea4)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420233b40, 0xc420496720, 0xc420496840, 0x0, 0x0)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420233b40, 0xc420496720, 0xc420496840)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420233b40)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:48.068707 log.go:144: ERR recovered from panic while fetching 'docker/cpu' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.068851 log.go:145: ERR Stacktrace: goroutine 75 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420588fe0)
/usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc42048e6e0, 0x50)
/go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0xbb241d, 0x5)
/go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420589368, 0xc4207d0930)
/go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/cpu.(*CPUService).getCpuStats(0x1176288, 0xc420589368, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/helper.go:52 +0x64
github.com/elastic/beats/metricbeat/module/docker/cpu.(*CPUService).getCPUStatsList(0x1176288, 0xc420769000, 0xd, 0xd, 0x0, 0x0, 0xc4284d1261)
/go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/helper.go:42 +0xeb
github.com/elastic/beats/metricbeat/module/docker/cpu.(*MetricSet).Fetch(0xc420480120, 0x284d1261, 0x11594c0, 0x40d84f, 0x115e8a0, 0x7f3c9df5c978)
/go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/cpu.go:52 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420232d00, 0x7f3c9df5c978, 0xc420480120, 0xc42041dc70, 0x1, 0xc42048e6e0, 0x50, 0xc42041dea4)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420232d00, 0xc420496720, 0xc420496840, 0x0, 0x0)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420232d00, 0xc420496720, 0xc420496840)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420232d00)
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:49.674812 client.go:291: ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value
2017/02/13 19:46:57.655048 metrics.go:39: INFO Non-zero metrics in the last 30s: fetches.docker-container.events=39 fetches.docker-container.success=3 fetches.docker-cpu.events=12 fetches.docker-cpu.success=1 fetches.docker-diskio.events=12 fetches.docker-diskio.success=1 fetches.docker-healthcheck.success=3 fetches.docker-info.events=3 fetches.docker-info.success=3 fetches.docker-memory.events=12 fetches.docker-memory.success=1 fetches.docker-network.success=2 libbeat.es.call_count.PublishEvents=4 libbeat.es.publish.read_bytes=17358 libbeat.es.publish.write_bytes=85118 libbeat.es.published_and_acked_events=77 libbeat.publisher.messages_in_worker_queues=78 libbeat.publisher.published_events=78
But now things seem to be reporting naturally.
I'm going to leave this open since the error reporting and inability to exit metricbeat probably characterizes this as a bug, but I'll wait for feedback.
The stack trace is coming from here where we don't check for the names
having at least one entry before calling names[0]
. That one should be easy to fix. Cc @ruflin
The first JSON encoding error I'm not sure how it can happen. Perhaps @urso has an idea?
Interesting that it can happen that a container doesn't have a name. @spalger Which docker version are you using?
For the long shutdown: If you have lots of containers, it can take quite a bit of time to shut down :-( Problem is that for each container we have to do a http request at the moment. So these can queue up :-( How many containers did you have?
Docker version 1.13.1, build 092cba3, and I had 11 containers at that time I think. I couldn't help but think that the metrics module was frozen or something. It wasn't using any resources and the log kept reporting No non-zero metrics in the last 30s
or something like that after the kill
was sent.
Looks like metricbeat has frozen up again, but this time just because a container crashed and was not instructed to restart on crash. Here are the series of events as I observed them:
ctrl-c
to stop metricbeat, but it's does not closedocker-compose up
to rebuild my containers-e
flag so I can produce log output attached belowconsole output: https://gist.github.com/spalger/a179d18dfc13b2a006a529634a6404e7
If anyone would like to take a look at this machine, just ping me offline with your public key
@spalger Thanks for all the details. It seems metricbeat is hanging in a request to the docker stats API (my assumption). I need to find a good way how to crash a container to reproduce this.
Btw, you can get a stacktrace from Metricbeat if you send it SIGQUIT. That might be helpful if it happens again.
@ruflin looks like the container doesn't need to crash, a short lived container that exits with 0 status seems to have the same effect
@spalger I managed to reproduce this. I opened https://github.com/elastic/beats/issues/3610 as this issue here contains also a second issue. This makes it easier to track / find.
@spalger I think I managed to find the bug and provide a fix here: https://github.com/elastic/beats/pull/3612 This should also fix Stacktrace report you saw above.
Not sure if FreeBSD jails count as containers to metricbeat, but I see this error a lot on my FreeNAS box:
ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value
In fact, it's the only error in any of the logs.
@untergeek Does it still report any data? It seems the container return an interesting "JSON" structure. It would be interesting to get hold of the JSON data that is received to have a closer look. I'm thinking if this could be an enconding issue or that the server does not return json, even though it should
I'm also having this issue:
ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value
Versions:
docker version: 1.26
metricbeat version: metricbeat version 5.2.2 (amd64), libbeat 5.2.2
Metricbeat is set to pull over metricsets: ['cpu', 'info', 'memory', 'network', 'diskio', 'container']
But It's pulling over info
, diskio
, network
, container
and cpu
, but failing to pull over memory
(Which i believe is part of this error).
Is there anyway to get at the struct docker returns for inspection?
@jess-lawrence-axomic You should be able to access the endpoint through GET /containers/(id or name)/stats
. See https://docs.docker.com/engine/api/v1.24/ Would be great if you could paste the output here to user it for some testing.
Hey @ruflin , here it is:
{
"blkio_stats": {
"io_merged_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 0
}
],
"io_queue_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 0
}
],
"io_service_bytes_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 389120
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 389120
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 389120
}
],
"io_service_time_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 24988598
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 24988598
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 24988598
}
],
"io_serviced_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 44
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 44
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 44
}
],
"io_time_recursive": [
{
"major": 202,
"minor": 0,
"op": "",
"value": 431
}
],
"io_wait_time_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 5653454862
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 5653454862
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 5653454862
}
],
"sectors_recursive": [
{
"major": 202,
"minor": 0,
"op": "",
"value": 760
}
]
},
"cpu_stats": {
"cpu_usage": {
"percpu_usage": [
375581793188,
203238317936
],
"total_usage": 578820111124,
"usage_in_kernelmode": 3860000000,
"usage_in_usermode": 10070000000
},
"system_cpu_usage": 642398270000000,
"throttling_data": {
"periods": 0,
"throttled_periods": 0,
"throttled_time": 0
}
},
"id": "0e921a961e2b1e671dd925440b2f8f072e772bc0e969734b00f00918984fe968",
"memory_stats": {},
"name": "/gallant_fermi",
"networks": {
"eth0": {
"rx_bytes": 202406,
"rx_dropped": 0,
"rx_errors": 0,
"rx_packets": 1253,
"tx_bytes": 119135,
"tx_dropped": 0,
"tx_errors": 0,
"tx_packets": 1313
}
},
"num_procs": 0,
"pids_stats": {},
"precpu_stats": {
"cpu_usage": {
"total_usage": 0,
"usage_in_kernelmode": 0,
"usage_in_usermode": 0
},
"throttling_data": {
"periods": 0,
"throttled_periods": 0,
"throttled_time": 0
}
},
"preread": "0001-01-01T00:00:00Z",
"read": "2017-03-27T10:09:05.342431137Z",
"storage_stats": {}
}
It seems to have omitted memory stats altogether. This node runs on an EC2 instance (HVM) so I suspect that could be causing issues?
Thanks for sharing the output. Very interesting. I'm curious what it means when the memory stats are empty. Perhaps @exekias or @douaejeouit know more?
That probably means the container is not running at the moment, so no mem info is reported. I'm afraid we need some more tracing to detect what happened there, as I don't see any value starting with N (from invalid character 'N' looking for beginning of value
)?
Thanks for sharing. According to the given logs, the problem occurs when the container stops while fetching stats. When this happen, there are no stats to report. I think we should handle this case instead of logging warning. (docker.go : 130)
else if err == nil && stats == nil {
logp.Warn("Container stopped when recovering stats: %v", container.ID)
}
Concerning the omitted memory stats, it's probably not a bug because, logically, if that were the case, you wouldn't have had any output. The cmd would probably crush or report an error ? But in this case, you're having your container statistics except for the memory ones. That's weird! It looks more like the memory stats & the storage stats were voluntary omitted, or at least this is the impression the output gives. I see that your docker is running on an EC2 instance, we should probably observe the stats cmd behaviour if possible. @jess-lawrence-axomic Could you please try it again with another container and share the output? Thank you in advance
Good morning @douaejeouit @exekias
The container was running when I retrieved the stats from the docker API, here is another container on a new node/host:
admin@ip-10-0-16-176:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bf8be0a4026c 866167950344.dkr.ecr.us-east-1.amazonaws.com/openasset-prd:latest "./start.sh" 23 hours ago Up 23 hours 0.0.0.0:8392->80/tcp festive_lalande
And the stats:
{
"blkio_stats": {
"io_merged_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 0
}
],
"io_queue_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 0
}
],
"io_service_bytes_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 8192
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 8192
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 8192
}
],
"io_service_time_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 537011
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 537011
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 537011
}
],
"io_serviced_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 1
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 1
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 1
}
],
"io_time_recursive": [
{
"major": 202,
"minor": 0,
"op": "",
"value": 1
}
],
"io_wait_time_recursive": [
{
"major": 202,
"minor": 0,
"op": "Read",
"value": 181040
},
{
"major": 202,
"minor": 0,
"op": "Write",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Sync",
"value": 0
},
{
"major": 202,
"minor": 0,
"op": "Async",
"value": 181040
},
{
"major": 202,
"minor": 0,
"op": "Total",
"value": 181040
}
],
"sectors_recursive": [
{
"major": 202,
"minor": 0,
"op": "",
"value": 16
}
]
},
"cpu_stats": {
"cpu_usage": {
"percpu_usage": [
70936811668,
48850403781
],
"total_usage": 119787215449,
"usage_in_kernelmode": 1140000000,
"usage_in_usermode": 4780000000
},
"system_cpu_usage": 171407860000000,
"throttling_data": {
"periods": 0,
"throttled_periods": 0,
"throttled_time": 0
}
},
"id": "bf8be0a4026cbd5cf86748da7d27cc3e21f54976dbbd6f46d3bd03e6bf536d25",
"memory_stats": {},
"name": "/festive_lalande",
"networks": {
"eth0": {
"rx_bytes": 145851,
"rx_dropped": 0,
"rx_errors": 0,
"rx_packets": 876,
"tx_bytes": 94039,
"tx_dropped": 0,
"tx_errors": 0,
"tx_packets": 910
}
},
"num_procs": 0,
"pids_stats": {},
"precpu_stats": {
"cpu_usage": {
"total_usage": 0,
"usage_in_kernelmode": 0,
"usage_in_usermode": 0
},
"throttling_data": {
"periods": 0,
"throttled_periods": 0,
"throttled_time": 0
}
},
"preread": "0001-01-01T00:00:00Z",
"read": "2017-03-29T08:43:44.292542038Z",
"storage_stats": {}
}
However I'm thinking this is more a node config issue as the memory cgroup isn't present:
admin@ip-10-0-16-176:~$ ls -l /sys/fs/cgroup/
total 0
dr-xr-xr-x 3 root root 0 Mar 28 08:58 blkio
lrwxrwxrwx 1 root root 11 Mar 28 08:58 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Mar 28 08:58 cpuacct -> cpu,cpuacct
dr-xr-xr-x 3 root root 0 Mar 28 08:58 cpu,cpuacct
dr-xr-xr-x 3 root root 0 Mar 28 08:58 cpuset
dr-xr-xr-x 3 root root 0 Mar 28 08:58 devices
dr-xr-xr-x 3 root root 0 Mar 28 08:58 freezer
lrwxrwxrwx 1 root root 16 Mar 28 08:58 net_cls -> net_cls,net_prio
dr-xr-xr-x 3 root root 0 Mar 28 08:58 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Mar 28 08:58 net_prio -> net_cls,net_prio
dr-xr-xr-x 3 root root 0 Mar 28 08:58 perf_event
dr-xr-xr-x 5 root root 0 Mar 28 08:58 systemd
Thank you for the quick reply. Yes, would suspect that! Unfortunately, I don't have a clear idea about how docker instance is managed on EC2 .. :( we'll keep you updated if we got some news ASAP.
I think a first improvement step we could do on the docker metricsets is to check if the json is empty before we start processing it and return an more meaningful error instead of the getting to the JSON Marshall error. Not sure how tricky that will be with the docker library we use.
Just a FYI, turns out debian disables memory cgroup by default, this solves it:
sed -i.bkup 's/\(GRUB_CMDLINE_LINUX_DEFAULT=.*\)"$/\1 cgroup_enable=memory"/g' /etc/default/grub
update-grub
@jess-lawrence-axomic that sounds good! Thank you for sharing the information. @ruflin, +1. Yes, we should do it. Neither do I but I'm thinking about it.
The trick is to make the fields a pointers here to take advantage of the "omitempty" option declared for , basically, all the fields. If the field's tag is "omitempty" and it has an empty value, the field should be omitted from the encoding. Empty values are :
I played some tests where a declare an empty stats structure, and I'm getting this output, which, I think , it represents the default values of the declared fields. output without pointers :
returned : Not valid json: json: unsupported value: NaN
output with pointer feilds :
{"_module":{"container":{"id":"containerID","labels":{"label1":"val1","label2":"val2"},"name":"name1"}},"fail":{"count":0},"limit":0,"rss":{"pct":null,"total":0},"usage":{"max":0,"pct":null,"total":0}}
@ruflin what do you think about it?
That looks very promising. Instead of reporting the default values I would suggest to report and error. Something like memory values are not available
. So in the case of @jess-lawrence-axomic he will see in the logs, that the memory values are not available and can turn off the metricset. Just providing default values would lead to the perception that it is working but isn't.
Update : Actually, the
Not valid json: json: unsupported value: NaN
happens simply because of the value stored here . Since an "empty" stats is, basically, a zeroed allocated memory storage ( it's a pointer to a newly allocated zero value of the structure ), ie, fields are equal to 0 at the creation. This is what the new function does! So basically, if there is any statistics to read, the values of the associated feilds will be overwritten, otherwise, it's 0. So in this case, we're applying a division where the denominator is equal to 0. The thing that, naturally, causes the NaN value. I think that the issue comes from here and not from the Marchall function. Because the 0 values of numbers are considered as an empty value => they are omitted. Please correct me if I'm wrong guys!! Here are my thoughts :
If a value is set (also 0) it will be reported. If we don't want to report it, we need to remove it from the event as otherwise it will be hard to tell, if it is 0 or accidentially 0 (as you pointed out). So if we don't get any json back, we should return an error and not any values I would think?
Based on @douaejeouit's last comment I tried to create a quick fix to just send 0
when there would be a NaN
(see referenced commit above). The issue is still present though:
2017-06-15T12:35:21+02:00 ERR Failed to publish events caused by: json:
error calling MarshalJSON for type common.Float: invalid character 'N'
looking for beginning of value
2017-06-15T12:35:21+02:00 INFO Error publishing events (retrying): json:
error calling MarshalJSON for type common.Float: invalid character 'N'
looking for beginning of value
Either I failed fixing the code (my first lines of go) or the NaN
is coming from elsewhere. Is there a way I could find out who is logging that error?
I also thought if MarshalJSON
could serialize NaN
into something sensible but based on the discussions here it seems that's not possible. So I guess any NaN
would have to be handled on your side?
As the error here does not have a stack trace, its tricky to figure out where it comes from. Probably interesting could be what is before and ofter especially if you have debug enabled.
This looks like the issue that was fixed by #11676 and backported to 6.7 and 7.0
I ran into the invalid character 'N'
log line while reproducing this, it is indeed a Nan
from division-by-zero. I'm gonna close this.
What I think happened:
system
module todocker
module./metricbeat
but it wasn't responding toctrl+c
kill -9
metricsets
configERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value
I've tried for fix this by wiping out the metricbeat data directory and restarting the box but am unable to get it work anymore.