Netflix / vector

Vector is an on-host performance monitoring framework which exposes hand picked high resolution metrics to every engineer’s browser.
http://getvector.io/
Apache License 2.0
3.58k stars 252 forks source link

Per-Container widget show "No Data Availalbe" #163

Closed sjtuhjh closed 6 years ago

sjtuhjh commented 7 years ago

The basic widget(such as CPU, network, memory) work fine, but all widget related with container show "No Data Available".

For example, the following case tries to show both "Memory" and "Per-Container CPU Utilization" whose request and response are listed as follows. However, only "Memory" display properly.

Http Request: http://192.168.1.86:44323/pmapi/1107686313/_fetch?names=containers.cgroup,containers.name,cgroup.cpuacct.usage,mem.util.free

Http Resp:

{"timestamp":{"s":1501643762,"us":351117 }, "values":[{"pmid":4194310,"name":"containers.cgroup","instances":[ {"instance":0, "value":"/system.slice/docker-8b064b8d4cbc286754fcd1031d9a9008dfa5be8adc710e3fb2896cc1cd06656f.scope" },{"instance":1, "value":"/system.slice/docker-72b67e4d6de18a493e9beb3b42dafd0680df2391e3a8855e37ab83919b516f19.scope" },{"instance":2, "value":"/system.slice/docker-6b5047d5173da546e77691ea96195e7c98876bf721ec162c7fb8be9375987a5b.scope" },{"instance":3, "value":"/system.slice/docker-6f1577377ed85472fda4518b83bdd2487276cf3eae482d5b8c1f9605f6933859.scope" },{"instance":4, "value":"/system.slice/docker-f0d96149d6d4d71dec401f055340307e1b2fb4bd6a4e8453fc19e11200930355.scope" },{"instance":5, "value":"/system.slice/docker-add03b69d812c97dd227994213a9b84627b301be56ed5e468d2925360806ac0e.scope" },{"instance":6, "value":"/system.slice/docker-c4ffe01954d4c1048ec2084d97805329d6a167a900ea4743e37e78327dca725d.scope" },{"instance":7, "value":"/system.slice/docker-762d2a3cd44d40319e8cf0821fb4dc672b82864c2569de4f03e630509a34201b.scope" },{"instance":8, "value":"/system.slice/docker-e4e7d332abc5526096d7b6553f09b429481bae8def1bc5053109f1932a6aa06f.scope" },{"instance":9, "value":"/system.slice/docker-4ae462c82e03062ac8c1d06a37652033951828734a96244309450b2369dc482b.scope" },{"instance":10, "value":"/system.slice/docker-640839fd833fee1bbaef3fa85b65cc7460f38ad32c5e8216b54b034d5c09615f.scope" },{"instance":11, "value":"/system.slice/docker-2751350dfd3109734b1b76b9a516d1461459e788a3fb6c6151868311c4029b21.scope" },{"instance":12, "value":"/system.slice/docker-d8a84a531c7cca0386b7b44378c0f1556346aea369d00d26ef8c37120367b05a.scope" },{"instance":13, "value":"/system.slice/docker-01c41aecc1b3724eacfede02167ae046ed00162c4d0aba78894b0375637e73f8.scope" },{"instance":14, "value":"/system.slice/docker-bdd6dfbec46824052f85322898b6a776add3f12ac2d49e340d0ed68ef7f0a83a.scope" },{"instance":15, "value":"/system.slice/docker-d3bc311ea55e7f648613f75a5b0a6a4b15750cea8b0c4c1979481f67ce565706.scope" },{"instance":16, "value":"/system.slice/docker-5e6ed946fa87953709d607bc20af2b74988ed68a53f8e46a17c190682fa1fc10.scope" },{"instance":17, "value":"/system.slice/docker-16c1256c88b9ae634093376c93ede7ea1f7c0dcb0c4173f9b7044e2ac141220a.scope" },{"instance":18, "value":"/system.slice/docker-bd33ba79a1500a2e82a13401006cf158711d7ba6ec73bf3aaa0e20fef959776b.scope" },{"instance":19, "value":"/system.slice/docker-b7af500211692e5a72bfa639ec4b1486036f61b38bba2af11d545339e0071a09.scope" },{"instance":20, "value":"/system.slice/docker-bf0b7c905684ae4ea70264dc9478da3aa4957e4ddce3081cd4abb7e01889f112.scope" },{"instance":21, "value":"/system.slice/docker-0b532d67d98c8b052c482915877e6f528a89c6065f16e71095b60ab1d24f857e.scope" },{"instance":22, "value":"/system.slice/docker-e0d43f796936c746d85337e2b0e2de59d8e1252f0955bd2894d8826ad58e4372.scope" },{"instance":23, "value":"/system.slice/docker-9774384973a65e53ef91fddbcecf5b79099936b98b15aa79555584a6bfc28925.scope" },{"instance":24, "value":"/system.slice/docker-c541796508d90c431445ef52b5f7803f56d5f6520bc3bf5ab757fc26cbceaa52.scope" }]}, {"pmid":4194305,"name":"containers.name","instances":[ {"instance":0, "value":"k8s_POD_monitoring-grafana-4072518999-5fdd8_kube-system_fe7a044e-6833-11e7-a978-c0a802ee0004_0" },{"instance":1, "value":"k8s_POD_kubernetes-dashboard-3981181111-v5t64_kube-system_af9c83c0-6873-11e7-a978-c0a802ee0004_0" },{"instance":2, "value":"k8s_POD_kube-scheduler-centos_kube-system_f24680dde0a73fe24c34094896acb898_0" },{"instance":3, "value":"k8s_kube-scheduler_kube-scheduler-centos_kube-system_f24680dde0a73fe24c34094896acb898_0" },{"instance":4, "value":"k8s_kube-flannel_kube-flannel-ds-8v27n_kube-system_2f4b65ce-6832-11e7-a978-c0a802ee0004_0" },{"instance":5, "value":"k8s_grafana_monitoring-grafana-4072518999-5fdd8_kube-system_fe7a044e-6833-11e7-a978-c0a802ee0004_0" },{"instance":6, "value":"k8s_kubernetes-dashboard_kubernetes-dashboard-3981181111-v5t64_kube-system_af9c83c0-6873-11e7-a978-c0a802ee0004_0" },{"instance":7, "value":"k8s_install-cni_kube-flannel-ds-8v27n_kube-system_2f4b65ce-6832-11e7-a978-c0a802ee0004_0" },{"instance":8, "value":"k8s_kubedns_kube-dns-2286869516-0hz5f_kube-system_17607aee-6832-11e7-a978-c0a802ee0004_0" },{"instance":9, "value":"k8s_heapster_heapster-2843423903-b3dgm_kube-system_251b9848-6833-11e7-a978-c0a802ee0004_0" },{"instance":10, "value":"k8s_POD_kube-apiserver-centos_kube-system_c5d63b165885460721c08e062c533ea5_0" },{"instance":11, "value":"k8s_POD_etcd-centos_kube-system_b4c88f8c66b21d09a76829f5730c0b2b_0" },{"instance":12, "value":"k8s_POD_kube-proxy-4fkls_kube-system_1751dc50-6832-11e7-a978-c0a802ee0004_0" },{"instance":13, "value":"k8s_sidecar_kube-dns-2286869516-0hz5f_kube-system_17607aee-6832-11e7-a978-c0a802ee0004_0" },{"instance":14, "value":"k8s_kube-proxy_kube-proxy-4fkls_kube-system_1751dc50-6832-11e7-a978-c0a802ee0004_0" },{"instance":15, "value":"k8s_etcd_etcd-centos_kube-system_b4c88f8c66b21d09a76829f5730c0b2b_0" },{"instance":16, "value":"k8s_POD_heapster-2843423903-b3dgm_kube-system_251b9848-6833-11e7-a978-c0a802ee0004_0" },{"instance":17, "value":"k8s_kube-controller-manager_kube-controller-manager-centos_kube-system_16bd019c32d4cbbbdf6516aba0701adf_0" },{"instance":18, "value":"k8s_POD_kube-controller-manager-centos_kube-system_16bd019c32d4cbbbdf6516aba0701adf_0" },{"instance":19, "value":"k8s_POD_kube-flannel-ds-8v27n_kube-system_2f4b65ce-6832-11e7-a978-c0a802ee0004_0" },{"instance":20, "value":"k8s_POD_monitoring-influxdb-562486248-3h3dx_kube-system_2550ef2a-6833-11e7-a978-c0a802ee0004_0" },{"instance":21, "value":"k8s_dnsmasq_kube-dns-2286869516-0hz5f_kube-system_17607aee-6832-11e7-a978-c0a802ee0004_0" },{"instance":22, "value":"k8s_influxdb_monitoring-influxdb-562486248-3h3dx_kube-system_2550ef2a-6833-11e7-a978-c0a802ee0004_0" },{"instance":23, "value":"k8s_kube-apiserver_kube-apiserver-centos_kube-system_c5d63b165885460721c08e062c533ea5_0" },{"instance":24, "value":"k8s_POD_kube-dns-2286869516-0hz5f_kube-system_17607aee-6832-11e7-a978-c0a802ee0004_0" }]}, {"pmid":12624898,"name":"cgroup.cpuacct.usage","instances":[ {"instance":0, "value":4933172881291360 },{"instance":1, "value":129896343034780 },{"instance":2, "value":34344311615640 },{"instance":3, "value":714945122980 },{"instance":4, "value":714918888420 },{"instance":5, "value":26234560 },{"instance":6, "value":796197274900 },{"instance":7, "value":796167501180 },{"instance":8, "value":29773720 },{"instance":9, "value":17739949351600 },{"instance":10, "value":22744280 },{"instance":11, "value":16780276211800 },{"instance":12, "value":7928199924620 },{"instance":13, "value":7928177683920 },{"instance":14, "value":22240700 },{"instance":15, "value":15779240 },{"instance":16, "value":983657548940 },{"instance":17, "value":25466600 },{"instance":18, "value":983632082340 },{"instance":19, "value":371512625340 },{"instance":20, "value":756403400 },{"instance":21, "value":370734284620 },{"instance":22, "value":21937320 },{"instance":23, "value":5497990940440 },{"instance":24, "value":24815180 },{"instance":25, "value":5497966125260 },{"instance":26, "value":95552031482520 },{"instance":27, "value":3357321921020 },{"instance":28, "value":19876880 },{"instance":29, "value":3355518204280 },{"instance":30, "value":31112849747960 },{"instance":31, "value":31112829336200 },{"instance":32, "value":20411760 },{"instance":33, "value":6314037052680 },{"instance":34, "value":4344877118660 },{"instance":35, "value":341030693080 },{"instance":36, "value":26561020 },{"instance":37, "value":1628102679920 },{"instance":38, "value":51031145037860 },{"instance":39, "value":20154260 },{"instance":40, "value":51031124883600 },{"instance":41, "value":1484065684220 },{"instance":42, "value":841638212120 },{"instance":44, "value":59236030660 },{"instance":45, "value":4789348258625900 },{"instance":46, "value":706748098020 },{"instance":47, "value":1563100 },{"instance":48, "value":0 },{"instance":49, "value":254087712480 },{"instance":50, "value":0 },{"instance":51, "value":0 },{"instance":52, "value":0 },{"instance":53, "value":0 },{"instance":54, "value":0 },{"instance":55, "value":0 },{"instance":56, "value":0 },{"instance":57, "value":0 },{"instance":58, "value":0 },{"instance":59, "value":47681864860 },{"instance":60, "value":374043482500 },{"instance":61, "value":5654771940 },{"instance":62, "value":0 },{"instance":63, "value":819622640 },{"instance":64, "value":0 },{"instance":65, "value":0 },{"instance":66, "value":85684428284820 },{"instance":67, "value":0 },{"instance":68, "value":0 },{"instance":69, "value":4656702398478900 },{"instance":70, "value":0 },{"instance":71, "value":0 },{"instance":72, "value":0 },{"instance":73, "value":0 },{"instance":74, "value":0 },{"instance":75, "value":0 },{"instance":76, "value":0 },{"instance":77, "value":0 },{"instance":78, "value":329761660 },{"instance":79, "value":3396200 },{"instance":80, "value":0 },{"instance":81, "value":0 },{"instance":82, "value":61116673340 },{"instance":83, "value":0 },{"instance":84, "value":0 },{"instance":85, "value":0 },{"instance":86, "value":0 },{"instance":87, "value":23126317060 },{"instance":88, "value":63612599380 },{"instance":89, "value":0 },{"instance":90, "value":0 },{"instance":91, "value":406311588180 },{"instance":92, "value":1870110241300 },{"instance":93, "value":2578555477820 },{"instance":94, "value":0 },{"instance":95, "value":0 },{"instance":96, "value":0 },{"instance":97, "value":16466060 },{"instance":98, "value":9045939800140 },{"instance":99, "value":57367804500 },{"instance":100, "value":0 },{"instance":101, "value":0 },{"instance":102, "value":0 },{"instance":103, "value":0 },{"instance":104, "value":0 },{"instance":105, "value":0 },{"instance":106, "value":0 },{"instance":107, "value":0 },{"instance":108, "value":1573310840 },{"instance":110, "value":81827591800 }]}, {"pmid":251659266,"name":"mem.util.free","instances":[ {"instance":-1, "value":1368844 }]}]}

spiermar commented 7 years ago

That's interesting. Seems like the the values are being returned from PCP, so this means it's likely one of the container id/name checks. Do you have a specific container selected in drop down?

spiermar commented 7 years ago

Vector checks a few things before adding container data to the widget. First, it check if it has metadata for that container. https://github.com/Netflix/vector/blob/master/src/app/components/containermetadata/containermetadata.service.js#L78

Then it check if a container was selected in the drop down: https://github.com/Netflix/vector/blob/master/src/app/components/containermetadata/containermetadata.service.js#L187

And if a container filter was applied: https://github.com/Netflix/vector/blob/master/src/app/components/containermetadata/containermetadata.service.js#L179

Judging by the response, if no filter is applied and no container name is selected, I believe it's a bug dealing with the Kubernetes container Ids. An issue parsing the /system.slice/docker-*.scope format.

sjtuhjh commented 7 years ago

@spiermar, I have tryied both "all" or "specific container" in drop down, but none of them work properly.

spiermar commented 7 years ago

@sjtuhjh I think I know where the problem is, so I'll try to get a fix by the end of the week. Once I do, would you be able to build Vector from master and check it?

sjtuhjh commented 7 years ago

@spiermar Sure! Please just let me know when you have done it.

sjtuhjh commented 7 years ago

@spiermar, have you fixed this issue?

spiermar commented 7 years ago

@sjtuhjh Not yet. Couldn't get the time yet.

spiermar commented 6 years ago

Fixed in 1520b6230ed7cc770d1a63b212a6ca3310c3b02c.

Right now the metric filtering mechanism is not ideal and relies on trying to match the metric instance iname to a cgroup id. Since iname format varies from one container solution to another, Vector has to try a few known formats and hopefully get a match. If its not a known format (your case), there will be no matches, and no data for the widgets.

Ideally we should have a way of resolving the cgroup id from the metric iname, or have PCP return the cgroup id as the iname for all cgroup.* metric instances.

If you're still facing the same issue, please open another bug and paste the output of the following command:

pminfo --fetch cgroup.cpuacct.usage

natoscott commented 6 years ago

@spiermar in pcp-4.0.0 we're adding a proc.id.container metric, which provides a mapping of each process to a container identifier (or empty string if not in a container). It would be a straightforward extension to add a cgroup.id.container metric which maps cgroups to container identifiers ... would that suffice for Vector's needs here?

The process map looks like this for example...

$ pminfo --fetch proc.id.container

proc.id.container inst [1 or "000001 /usr/lib/systemd/systemd"] value "" inst [2 or "000002 (kthreadd)"] value "" [...] inst [23819 or "023819 /bin/bash"] value "89c022298234d1672b6ce980bd2d243f683129d420e57c2b200522982a89d901"

$ pminfo --fetch containers.name

containers.name inst [0 or "cecbc18de74210533c6420e7db83b97c0fc83797fd413f24d1c0902765ad0e54"] value "gracious_lamport" [...] inst [6 or "89c022298234d1672b6ce980bd2d243f683129d420e57c2b200522982a89d901"] value "kickass_darwin"

spiermar commented 6 years ago

I'm not sure. The inames I'm getting back are not process names I'm afraid:

(root) ~ # pminfo --fetch cgroup.cpuacct.usage

cgroup.cpuacct.usage
    inst [0 or “/”] value 397791829816031
    inst [1 or “/containers.slice”] value 315276039880779
    inst [2 or “/containers.slice/f0c5f7e08857622beee10d890451975fab6fc7fcc5201b50be69aaaead58175e”] value 116292154363507
    inst [3 or “/containers.slice/e8f630e6a9688444ac037a4740e8194ae098e4e211c9eb4153e7addf411c9632"] value 3653554698
    inst [4 or “/containers.slice/c3b5a3e452f4662ada9e51b70a17fc06420ad7f4d214fd10cccd28141f723218”] value 3940084417366
    inst [5 or “/containers.slice/6e9091b498a9b30a40c2c4b9a2cd52011f0bb6b74be7eb7e9c4c34fa13f83c04"] value 627030911
    inst [6 or “/containers.slice/9c8249cd6efa7356795177e83078e84eb18cc582996d2f42b061fbfe6916923b”] value 26546926583800
    inst [7 or “/containers.slice/384aaf35bda9360c2d020de6e4cbfb492c694a2575e6f9205f71d0092b06cf45"] value 188972410329
    inst [8 or “/containers.slice/cacd7ae56f9735710723cb1dcac18a5d4ad30308f3af1f0c7a3e5036a1aa6d23”] value 28755269869387
    inst [9 or “/containers.slice/5e99b94082c42263026b7324e8848d1a547cf0681ce7907b8a8dfe7ca007bd1f”] value 3591267944
    inst [10 or “/containers.slice/ebde8ad364baa72729943991c4c06177d71ba89e41f6d32e58b78983334ecb14”] value 3689032371
    inst [11 or “/containers.slice/a25b703d7f62584ca3f264c69a713b5183cecb3081bef370ba198616a5859f78"] value 114204317
    inst [12 or “/containers.slice/ef0aa6ca78f5ef63b08d7be04cc53966183c90a2bc145f6df197c389e9d3fb1e”] value 854105776
    inst [13 or “/containers.slice/a425347a2f8127531543bc28f53bcd6dc8ecf096c529477a8b97735d1e7b206d”] value 5981282187709

The container id is there, but since the format can change, the feature can break again in the future. I'm also not a big fan of parsing strings and trying to match them to something.

natoscott commented 6 years ago

@spiermar yep, that's the idea - no string parsing anymore - the translation from cgroup name to container name is done in PCP and we could export it via cgroup.id.container

In your example above, yes those are not process names - they're the cgroup names exported from the kernel, and always will be for that metric. However, we can externalize the mapping that is being done in PCP for your convenience - it would look something like...

$ pminfo --fetch cgroup.cpuacct.usage

cgroup.cpuacct.usage inst [0 or “/”] value 397791829816031 inst [1 or “/containers.slice”] value 315276039880779 inst [2 or “/containers.slice/f0c5f7e08857622beee10d890451975fab6fc7fcc5201b50be69aaaead58175e”] value 116292154363507 [...]

$ pminfo --fetch cgroup.id.container inst [0 or “/”] value "" inst [1 or “/containers.slice”] value "" inst [2 or “/containers.slice/f0c5f7e08857622beee10d890451975fab6fc7fcc5201b50be69aaaead58175e”] value "f0c5f7e08857622beee10d890451975fab6fc7fcc5201b50be69aaaead58175e" [...]

$ pminfo --fetch container.name inst [0 or "f0c5f7e08857622beee10d890451975fab6fc7fcc5201b50be69aaaead58175e"] value "gracious_lamport" [...]

(output for cgroup.cpuacct.usage remains the same, as in your example above)

spiermar commented 6 years ago

That should solve the issue. I'll add a note to check this again once pcp-4.0.0 is out. Thanks @natoscott

natoscott commented 6 years ago

@spiermar this is implemented now in pcp master branch.

It turned out one thing I said earlier was not quite correct. The metrics need to be specific to the various cgroup types - IOW, "cgroup.id.container" from my earlier post turns into:

$ pminfo cgroup | grep container cgroup.cpuset.id.container cgroup.cpuacct.id.container cgroup.cpusched.id.container cgroup.memory.id.container cgroup.netclass.id.container cgroup.blkio.id.container