elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.2k stars 4.93k forks source link

metricbeat: error calling MarshalJSON for type common.Float #3580

Closed spalger closed 5 years ago

spalger commented 7 years ago

What I think happened:

I've tried for fix this by wiping out the metricbeat data directory and restarting the box but am unable to get it work anymore.

spalger commented 7 years ago

Looks like the issue was that some of the containers I had created were busted and were constantly restarting. This must have caused an error when metricbeat tried to read something.

When I updated the containers to remove the busted ones I got the following in my logs:

2017/02/13 19:46:48.065754 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065750 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065795 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065910 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.067729 log.go:144: ERR recovered from panic while fetching 'docker/diskio' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.067922 log.go:145: ERR Stacktrace: goroutine 76 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420584f80)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc4203ef380, 0x53)
        /go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420585368, 0xc420a78090)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/diskio.(*BLkioService).getBlkioStats(0xc4204f00c8, 0xc420585368, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/helper.go:52 +0x14f
github.com/elastic/beats/metricbeat/module/docker/diskio.(*BLkioService).getBlkioStatsList(0xc4204f00c8, 0xc42058d000, 0xd, 0xd, 0x0, 0x0, 0xc4284ecd42)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/helper.go:40 +0xe2
github.com/elastic/beats/metricbeat/module/docker/diskio.(*MetricSet).Fetch(0xc4204801b0, 0x284ecd42, 0x11594c0, 0x40d84f, 0x115d5c8, 0x7f3c9df5ca08)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/diskio.go:54 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420232e40, 0x7f3c9df5ca08, 0xc4204801b0, 0xc420105c70, 0x1, 0xc4203ef380, 0x53, 0xc420105ea4)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420232e40, 0xc420496720, 0xc420496840, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420232e40, 0xc420496720, 0xc420496840)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420232e40)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:48.067742 log.go:144: ERR recovered from panic while fetching 'docker/memory' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.068297 log.go:145: ERR Stacktrace: goroutine 79 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420250c78)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc42034a2a0, 0x53)
        /go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0xc4207300d1, 0x5)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420250ee8, 0xc42001a5d0)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/memory.(*MemoryService).GetMemoryStats(0x1176288, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/memory/helper.go:34 +0x60
github.com/elastic/beats/metricbeat/module/docker/memory.(*MemoryService).getMemoryStatsList(0x1176288, 0xc420766000, 0xd, 0xd, 0x0, 0x0, 0xc4284cb1cf)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/memory/helper.go:25 +0x104
github.com/elastic/beats/metricbeat/module/docker/memory.(*MetricSet).Fetch(0xc420480360, 0x284cb1cf, 0x11594c0, 0x40d84f, 0x115db50, 0x7f3c9df5cbb8)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/memory/memory.go:52 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420233b40, 0x7f3c9df5cbb8, 0xc420480360, 0xc420073c70, 0x1, 0xc42034a2a0, 0x53, 0xc420073ea4)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420233b40, 0xc420496720, 0xc420496840, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420233b40, 0xc420496720, 0xc420496840)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420233b40)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:48.068707 log.go:144: ERR recovered from panic while fetching 'docker/cpu' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.068851 log.go:145: ERR Stacktrace: goroutine 75 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420588fe0)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc42048e6e0, 0x50)
        /go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0xbb241d, 0x5)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420589368, 0xc4207d0930)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/cpu.(*CPUService).getCpuStats(0x1176288, 0xc420589368, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/helper.go:52 +0x64
github.com/elastic/beats/metricbeat/module/docker/cpu.(*CPUService).getCPUStatsList(0x1176288, 0xc420769000, 0xd, 0xd, 0x0, 0x0, 0xc4284d1261)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/helper.go:42 +0xeb
github.com/elastic/beats/metricbeat/module/docker/cpu.(*MetricSet).Fetch(0xc420480120, 0x284d1261, 0x11594c0, 0x40d84f, 0x115e8a0, 0x7f3c9df5c978)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/cpu.go:52 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420232d00, 0x7f3c9df5c978, 0xc420480120, 0xc42041dc70, 0x1, 0xc42048e6e0, 0x50, 0xc42041dea4)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420232d00, 0xc420496720, 0xc420496840, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420232d00, 0xc420496720, 0xc420496840)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420232d00)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:49.674812 client.go:291: ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value
2017/02/13 19:46:57.655048 metrics.go:39: INFO Non-zero metrics in the last 30s: fetches.docker-container.events=39 fetches.docker-container.success=3 fetches.docker-cpu.events=12 fetches.docker-cpu.success=1 fetches.docker-diskio.events=12 fetches.docker-diskio.success=1 fetches.docker-healthcheck.success=3 fetches.docker-info.events=3 fetches.docker-info.success=3 fetches.docker-memory.events=12 fetches.docker-memory.success=1 fetches.docker-network.success=2 libbeat.es.call_count.PublishEvents=4 libbeat.es.publish.read_bytes=17358 libbeat.es.publish.write_bytes=85118 libbeat.es.published_and_acked_events=77 libbeat.publisher.messages_in_worker_queues=78 libbeat.publisher.published_events=78

But now things seem to be reporting naturally.

I'm going to leave this open since the error reporting and inability to exit metricbeat probably characterizes this as a bug, but I'll wait for feedback.

tsg commented 7 years ago

The stack trace is coming from here where we don't check for the names having at least one entry before calling names[0]. That one should be easy to fix. Cc @ruflin

The first JSON encoding error I'm not sure how it can happen. Perhaps @urso has an idea?

ruflin commented 7 years ago

Interesting that it can happen that a container doesn't have a name. @spalger Which docker version are you using?

For the long shutdown: If you have lots of containers, it can take quite a bit of time to shut down :-( Problem is that for each container we have to do a http request at the moment. So these can queue up :-( How many containers did you have?

spalger commented 7 years ago

Docker version 1.13.1, build 092cba3, and I had 11 containers at that time I think. I couldn't help but think that the metrics module was frozen or something. It wasn't using any resources and the log kept reporting No non-zero metrics in the last 30s or something like that after the kill was sent.

spalger commented 7 years ago

Looks like metricbeat has frozen up again, but this time just because a container crashed and was not instructed to restart on crash. Here are the series of events as I observed them:

console output: https://gist.github.com/spalger/a179d18dfc13b2a006a529634a6404e7

spalger commented 7 years ago

If anyone would like to take a look at this machine, just ping me offline with your public key

ruflin commented 7 years ago

@spalger Thanks for all the details. It seems metricbeat is hanging in a request to the docker stats API (my assumption). I need to find a good way how to crash a container to reproduce this.

tsg commented 7 years ago

Btw, you can get a stacktrace from Metricbeat if you send it SIGQUIT. That might be helpful if it happens again.

spalger commented 7 years ago

@ruflin looks like the container doesn't need to crash, a short lived container that exits with 0 status seems to have the same effect

ruflin commented 7 years ago

@spalger I managed to reproduce this. I opened https://github.com/elastic/beats/issues/3610 as this issue here contains also a second issue. This makes it easier to track / find.

ruflin commented 7 years ago

@spalger I think I managed to find the bug and provide a fix here: https://github.com/elastic/beats/pull/3612 This should also fix Stacktrace report you saw above.

untergeek commented 7 years ago

Not sure if FreeBSD jails count as containers to metricbeat, but I see this error a lot on my FreeNAS box:

ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value

In fact, it's the only error in any of the logs.

ruflin commented 7 years ago

@untergeek Does it still report any data? It seems the container return an interesting "JSON" structure. It would be interesting to get hold of the JSON data that is received to have a closer look. I'm thinking if this could be an enconding issue or that the server does not return json, even though it should

jess-lawrence-axomic commented 7 years ago

I'm also having this issue:

ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value

Versions:

docker version: 1.26 metricbeat version: metricbeat version 5.2.2 (amd64), libbeat 5.2.2

Metricbeat is set to pull over metricsets: ['cpu', 'info', 'memory', 'network', 'diskio', 'container']

But It's pulling over info, diskio, network, container and cpu, but failing to pull over memory (Which i believe is part of this error).

Is there anyway to get at the struct docker returns for inspection?

ruflin commented 7 years ago

@jess-lawrence-axomic You should be able to access the endpoint through GET /containers/(id or name)/stats. See https://docs.docker.com/engine/api/v1.24/ Would be great if you could paste the output here to user it for some testing.

jess-lawrence-axomic commented 7 years ago

Hey @ruflin , here it is:

{
    "blkio_stats": {
        "io_merged_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_queue_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_service_bytes_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 389120
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 389120
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 389120
            }
        ],
        "io_service_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 24988598
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 24988598
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 24988598
            }
        ],
        "io_serviced_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 44
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 44
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 44
            }
        ],
        "io_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 431
            }
        ],
        "io_wait_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 5653454862
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 5653454862
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 5653454862
            }
        ],
        "sectors_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 760
            }
        ]
    },
    "cpu_stats": {
        "cpu_usage": {
            "percpu_usage": [
                375581793188,
                203238317936
            ],
            "total_usage": 578820111124,
            "usage_in_kernelmode": 3860000000,
            "usage_in_usermode": 10070000000
        },
        "system_cpu_usage": 642398270000000,
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "id": "0e921a961e2b1e671dd925440b2f8f072e772bc0e969734b00f00918984fe968",
    "memory_stats": {},
    "name": "/gallant_fermi",
    "networks": {
        "eth0": {
            "rx_bytes": 202406,
            "rx_dropped": 0,
            "rx_errors": 0,
            "rx_packets": 1253,
            "tx_bytes": 119135,
            "tx_dropped": 0,
            "tx_errors": 0,
            "tx_packets": 1313
        }
    },
    "num_procs": 0,
    "pids_stats": {},
    "precpu_stats": {
        "cpu_usage": {
            "total_usage": 0,
            "usage_in_kernelmode": 0,
            "usage_in_usermode": 0
        },
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "preread": "0001-01-01T00:00:00Z",
    "read": "2017-03-27T10:09:05.342431137Z",
    "storage_stats": {}
}

It seems to have omitted memory stats altogether. This node runs on an EC2 instance (HVM) so I suspect that could be causing issues?

ruflin commented 7 years ago

Thanks for sharing the output. Very interesting. I'm curious what it means when the memory stats are empty. Perhaps @exekias or @douaejeouit know more?

exekias commented 7 years ago

That probably means the container is not running at the moment, so no mem info is reported. I'm afraid we need some more tracing to detect what happened there, as I don't see any value starting with N (from invalid character 'N' looking for beginning of value)?

douaejeouit commented 7 years ago

Thanks for sharing. According to the given logs, the problem occurs when the container stops while fetching stats. When this happen, there are no stats to report. I think we should handle this case instead of logging warning. (docker.go : 130)

else if err == nil && stats == nil {
            logp.Warn("Container stopped when recovering stats: %v", container.ID)
        }

Concerning the omitted memory stats, it's probably not a bug because, logically, if that were the case, you wouldn't have had any output. The cmd would probably crush or report an error ? But in this case, you're having your container statistics except for the memory ones. That's weird! It looks more like the memory stats & the storage stats were voluntary omitted, or at least this is the impression the output gives. I see that your docker is running on an EC2 instance, we should probably observe the stats cmd behaviour if possible. @jess-lawrence-axomic Could you please try it again with another container and share the output? Thank you in advance

jess-lawrence-axomic commented 7 years ago

Good morning @douaejeouit @exekias

The container was running when I retrieved the stats from the docker API, here is another container on a new node/host:

admin@ip-10-0-16-176:~$ docker ps
CONTAINER ID        IMAGE                                                               COMMAND             CREATED             STATUS              PORTS                  NAMES
bf8be0a4026c        866167950344.dkr.ecr.us-east-1.amazonaws.com/openasset-prd:latest   "./start.sh"        23 hours ago        Up 23 hours         0.0.0.0:8392->80/tcp   festive_lalande

And the stats:

{
    "blkio_stats": {
        "io_merged_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_queue_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_service_bytes_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 8192
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 8192
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 8192
            }
        ],
        "io_service_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 537011
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 537011
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 537011
            }
        ],
        "io_serviced_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 1
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 1
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 1
            }
        ],
        "io_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 1
            }
        ],
        "io_wait_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 181040
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 181040
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 181040
            }
        ],
        "sectors_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 16
            }
        ]
    },
    "cpu_stats": {
        "cpu_usage": {
            "percpu_usage": [
                70936811668,
                48850403781
            ],
            "total_usage": 119787215449,
            "usage_in_kernelmode": 1140000000,
            "usage_in_usermode": 4780000000
        },
        "system_cpu_usage": 171407860000000,
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "id": "bf8be0a4026cbd5cf86748da7d27cc3e21f54976dbbd6f46d3bd03e6bf536d25",
    "memory_stats": {},
    "name": "/festive_lalande",
    "networks": {
        "eth0": {
            "rx_bytes": 145851,
            "rx_dropped": 0,
            "rx_errors": 0,
            "rx_packets": 876,
            "tx_bytes": 94039,
            "tx_dropped": 0,
            "tx_errors": 0,
            "tx_packets": 910
        }
    },
    "num_procs": 0,
    "pids_stats": {},
    "precpu_stats": {
        "cpu_usage": {
            "total_usage": 0,
            "usage_in_kernelmode": 0,
            "usage_in_usermode": 0
        },
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "preread": "0001-01-01T00:00:00Z",
    "read": "2017-03-29T08:43:44.292542038Z",
    "storage_stats": {}
}

However I'm thinking this is more a node config issue as the memory cgroup isn't present:

admin@ip-10-0-16-176:~$ ls -l /sys/fs/cgroup/
total 0
dr-xr-xr-x 3 root root  0 Mar 28 08:58 blkio
lrwxrwxrwx 1 root root 11 Mar 28 08:58 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Mar 28 08:58 cpuacct -> cpu,cpuacct
dr-xr-xr-x 3 root root  0 Mar 28 08:58 cpu,cpuacct
dr-xr-xr-x 3 root root  0 Mar 28 08:58 cpuset
dr-xr-xr-x 3 root root  0 Mar 28 08:58 devices
dr-xr-xr-x 3 root root  0 Mar 28 08:58 freezer
lrwxrwxrwx 1 root root 16 Mar 28 08:58 net_cls -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Mar 28 08:58 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Mar 28 08:58 net_prio -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Mar 28 08:58 perf_event
dr-xr-xr-x 5 root root  0 Mar 28 08:58 systemd
douaejeouit commented 7 years ago

Thank you for the quick reply. Yes, would suspect that! Unfortunately, I don't have a clear idea about how docker instance is managed on EC2 .. :( we'll keep you updated if we got some news ASAP.

ruflin commented 7 years ago

I think a first improvement step we could do on the docker metricsets is to check if the json is empty before we start processing it and return an more meaningful error instead of the getting to the JSON Marshall error. Not sure how tricky that will be with the docker library we use.

jess-lawrence-axomic commented 7 years ago

Just a FYI, turns out debian disables memory cgroup by default, this solves it:

sed -i.bkup 's/\(GRUB_CMDLINE_LINUX_DEFAULT=.*\)"$/\1 cgroup_enable=memory"/g' /etc/default/grub
update-grub
douaejeouit commented 7 years ago

@jess-lawrence-axomic that sounds good! Thank you for sharing the information. @ruflin, +1. Yes, we should do it. Neither do I but I'm thinking about it.

douaejeouit commented 7 years ago

The trick is to make the fields a pointers here to take advantage of the "omitempty" option declared for , basically, all the fields. If the field's tag is "omitempty" and it has an empty value, the field should be omitted from the encoding. Empty values are :

I played some tests where a declare an empty stats structure, and I'm getting this output, which, I think , it represents the default values of the declared fields. output without pointers :

returned : Not valid json: json: unsupported value: NaN

output with pointer feilds :

{"_module":{"container":{"id":"containerID","labels":{"label1":"val1","label2":"val2"},"name":"name1"}},"fail":{"count":0},"limit":0,"rss":{"pct":null,"total":0},"usage":{"max":0,"pct":null,"total":0}}

@ruflin what do you think about it?

ruflin commented 7 years ago

That looks very promising. Instead of reporting the default values I would suggest to report and error. Something like memory values are not available. So in the case of @jess-lawrence-axomic he will see in the logs, that the memory values are not available and can turn off the metricset. Just providing default values would lead to the perception that it is working but isn't.

douaejeouit commented 7 years ago

Update : Actually, the

Not valid json: json: unsupported value: NaN

happens simply because of the value stored here . Since an "empty" stats is, basically, a zeroed allocated memory storage ( it's a pointer to a newly allocated zero value of the structure ), ie, fields are equal to 0 at the creation. This is what the new function does! So basically, if there is any statistics to read, the values of the associated feilds will be overwritten, otherwise, it's 0. So in this case, we're applying a division where the denominator is equal to 0. The thing that, naturally, causes the NaN value. I think that the issue comes from here and not from the Marchall function. Because the 0 values of numbers are considered as an empty value => they are omitted. Please correct me if I'm wrong guys!! Here are my thoughts :

ruflin commented 7 years ago

If a value is set (also 0) it will be reported. If we don't want to report it, we need to remove it from the event as otherwise it will be hard to tell, if it is 0 or accidentially 0 (as you pointed out). So if we don't get any json back, we should return an error and not any values I would think?

xificurC commented 7 years ago

Based on @douaejeouit's last comment I tried to create a quick fix to just send 0 when there would be a NaN (see referenced commit above). The issue is still present though:

2017-06-15T12:35:21+02:00 ERR Failed to publish events caused by: json: 
error calling MarshalJSON for type common.Float: invalid character 'N' 
looking for beginning of value
2017-06-15T12:35:21+02:00 INFO Error publishing events (retrying): json: 
error calling MarshalJSON for type common.Float: invalid character 'N' 
looking for beginning of value

Either I failed fixing the code (my first lines of go) or the NaN is coming from elsewhere. Is there a way I could find out who is logging that error?

I also thought if MarshalJSON could serialize NaN into something sensible but based on the discussions here it seems that's not possible. So I guess any NaN would have to be handled on your side?

ruflin commented 7 years ago

As the error here does not have a stack trace, its tricky to figure out where it comes from. Probably interesting could be what is before and ofter especially if you have debug enabled.

fearful-symmetry commented 5 years ago

This looks like the issue that was fixed by #11676 and backported to 6.7 and 7.0 I ran into the invalid character 'N' log line while reproducing this, it is indeed a Nan from division-by-zero. I'm gonna close this.