Slave statistics endpoint empty

eBayClassifiedsGroup / PanteraS

PanteraS - PaaS - Platform as a Service in a box

GNU General Public License v2.0

199 stars 61 forks source link

Slave statistics endpoint empty #229

Closed cookandy closed 8 years ago

cookandy commented 8 years ago

Any ideas why the /monitor/statistics endpoint returns empty []? I briefly looked at Mesos issues but didn't find anything. Wondering if you're experiencing the same...

root@03:~# curl -v http://10.10.23.59:5051/monitor/statistics
*   Trying 10.134.15.87...
* Connected to 10.134.15.87 (10.134.15.87) port 5051 (#0)
> GET /monitor/statistics HTTP/1.1
> Host: 10.134.15.87:5051
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 28 Oct 2016 19:11:21 GMT
< Content-Length: 2
< Content-Type: application/json
<
* Connection #0 to host 10.134.15.87 left intact
[]

sielaq commented 8 years ago

to be honest, we use mostly /metrics/snapshot endpoint for metrics and collectd to collect them: like https://github.com/rayrod2030/collectd-mesos and app-container metrics (like CPU/memory etc.) we take from docker like https://github.com/bogus-py/docker-collectd-plugin

/monitor/statistics seems to return nothing [] or very truncated information like one app instead of ten running:

[
    {
        "executor_id": "app.8e2df3af-9c58-11e6-b4ed-0242289818c8",
        "executor_name": "Command Executor (Task: app.8e2df3af-9c58-11e6-b4ed-0242289818c8) (Command: NO EXECUTABLE)",
        "framework_id": "b2587f4a-53f7-40e0-a565-a89ec175a650-0000",
        "source": "app.8e2df3af-9c58-11e6-b4ed-0242289818c8",
        "statistics": {
            "cpus_limit": 0.2,
            "cpus_system_time_secs": 132569.84,
            "cpus_user_time_secs": 76519.61,
            "mem_limit_bytes": 570425344,
            "mem_rss_bytes": 323592192,
            "timestamp": 1477812340.71858
        }
    }
]

cookandy commented 8 years ago

I am seeing things like this in the Mesos agent logs:

Failed to get resource statistics for executor 'sysdig-agent.1951556a-9ad9-11e6-a24a-0242aec327c9' of framework efaaca88-e937-4288-b929-2a0bd940e70a-0000: Failed to collect cgroup stats: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/2653/cgroup: Failed to open file: No such file or directory

Is it because /proc is mapped to /host/proc?

/proc:/host/proc:ro

Why is it done this way? Can we tell Mesos agent to use /host/proc and /host/sys?

See this: https://issues.apache.org/jira/browse/MESOS-3533?focusedCommentId=14933968&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14933968

sielaq commented 8 years ago

Is it because /proc is mapped to /host/proc?

I don't think so. It seems like mesos issue. Inside container it is not allowed to bind /proc:/proc one to one since PIDs inside container are not the same as in the host. Which means, mesos should look for /host/proc first, if not exists then /proc (or should do autodetect if it's running inside container).

cookandy commented 8 years ago

Are you seeing similar errors in your agent logs? I am not sure if this appears with older versions. I also can't seem to find an agent option parameter to configure this..

sielaq commented 8 years ago

Yes I see them too when I do request for /monitor/statistics (it is visible for old and new versions )

cookandy commented 8 years ago

Looks like this can be resolved by running the container with the --pid=host option. Hope this helps someone.

sielaq commented 8 years ago

I will add it as default option then.

sielaq commented 8 years ago

It should not have impact for the rest of the system.

cookandy commented 8 years ago

this should do it: https://github.com/eBayClassifiedsGroup/PanteraS/pull/230

sielaq commented 8 years ago

heh you were faster :) and merged