HewlettPackard / lustre_exporter

Prometheus exporter for use with the Lustre parallel filesystem
Apache License 2.0
34 stars 51 forks source link

Getting messages "was collected before with the same name and label values" #135

Open Zubrania opened 6 years ago

Zubrania commented 6 years ago

Hello

Getting the messages like "was collected before with the same name and label values" in /var/log/messages and the file is constantly growing

[root]# lustre_exporter --version lustre_exporter, version 2.0.0 (branch: HEAD, revision: 61775378252b23c794e6725445e2bf0a620d9027) build user: prometheus@rc-lustre-oss-2.dev.net build date: 20171205-22:27:17 go version: go1.8.3

joehandzik commented 6 years ago

@Zubrania Do you have any information about your cluster configuration? Is this happening on all nodes, or just some of them? Also, what version of Lustre are you using? Older versions are pretty tied to the 1.0.0 release that we have.

Zubrania commented 6 years ago

@joehandzik The cluster consists of of 2 mgs/mds servers and 4 oss servers Lustre version is 2.10.3 RHEL 7.4 based

erijpkema commented 6 years ago

I'm also seeing this with lustre_version 2.10.4 RHEL 7.5. We're running 3 filestystems with 2 mgs/mds and 4 oss servers each. Example from an ost:

>  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"destroy" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"create" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"get_info" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"set_info" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"quotactl" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values

And from an mdt.

0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"getxattr" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"setxattr" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"statfs" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"sync" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"samedir_rename" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"crossdir_rename" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values

I'm running the exporter with the following command. /usr/local/prometheus/lustre_exporter --collector.ost=core --collector.mdt=core --collector.mgs=extended --collector.generic=core

It was built today from git source today in the golang:1.9-stretch docker image. (Docker was only used for building)

wutz commented 5 years ago

I have noticed that the metric label is jobid=.0 which cause report error, the process name has missing.

ldd91 commented 5 years ago

I'm also seeing this with lustre_version 2.12.0 RHEL 7.5.But it was seen in MDS node,the OSS node is normal

ldd91 commented 5 years ago

@wutzx Have you solved this problem?

wutz commented 5 years ago

@ldd91 Your clone https://github.com/wutzx/lustre_exporter with my PR, and build it.

ldd91 commented 5 years ago

@wutzx I clone your PR,and build it ,but met an error [root@k8sv2node1 lustre_exporter]# make

formatting code linting code WARNING: Linters are now vendored by default, --update ignored. The original behaviour can be re-enabled with --no-vendored-linters.

To request an update for a vendored linter file an issue at: https://github.com/alecthomas/gometalinter/issues/new

WARNING: deadline exceeded by linter vetshadow (try increasing --deadline) WARNING: deadline exceeded by linter varcheck (try increasing --deadline) WARNING: deadline exceeded by linter interfacer (try increasing --deadline) make: *** [gometalinter] Error 2 [root@k8sv2node1 lustre_exporter]# ll total 388 -rw-r--r-- 1 root root 18526 Mar 21 17:51 CHANGELOG.md -rw-r--r-- 1 root root 2428 Mar 21 17:51 Gopkg.lock -rw-r--r-- 1 root root 731 Mar 21 17:51 Gopkg.toml -rw-r--r-- 1 root root 11357 Mar 21 17:51 LICENSE -rw-r--r-- 1 root root 6488 Mar 21 17:51 lustre_exporter.go -rw-r--r-- 1 root root 312818 Mar 21 17:51 lustre_exporter_test.go -rw-r--r-- 1 root root 2051 Mar 21 17:51 Makefile drwxr-xr-x 4 root root 4096 Mar 21 17:51 proc -rw-r--r-- 1 root root 2896 Mar 21 17:51 README.md drwxr-xr-x 2 root root 4096 Mar 21 17:51 sources drwxr-xr-x 3 root root 4096 Mar 21 17:51 sys drwxr-xr-x 2 root root 4096 Mar 21 17:51 systemd drwxr-xr-x 5 root root 4096 Mar 21 17:51 vendor -rw-r--r-- 1 root root 6 Mar 21 17:51 VERSION

wutz commented 5 years ago

Your can execute make build to skip linter checking.

ldd91 notifications@github.com 于2019年3月21日周四 下午6:00写道:

@wutzx https://github.com/wutzx I clone your PR,and exec make ,but met an error [root@k8sv2node1 lustre_exporter]# make

formatting code linting code WARNING: Linters are now vendored by default, --update ignored. The original behaviour can be re-enabled with --no-vendored-linters.

To request an update for a vendored linter file an issue at: https://github.com/alecthomas/gometalinter/issues/new

WARNING: deadline exceeded by linter vetshadow (try increasing --deadline) WARNING: deadline exceeded by linter varcheck (try increasing --deadline) WARNING: deadline exceeded by linter interfacer (try increasing --deadline) make: *** [gometalinter] Error 2 [root@k8sv2node1 lustre_exporter]# ll total 388 -rw-r--r-- 1 root root 18526 Mar 21 17:51 CHANGELOG.md -rw-r--r-- 1 root root 2428 Mar 21 17:51 Gopkg.lock -rw-r--r-- 1 root root 731 Mar 21 17:51 Gopkg.toml -rw-r--r-- 1 root root 11357 Mar 21 17:51 LICENSE -rw-r--r-- 1 root root 6488 Mar 21 17:51 lustre_exporter.go -rw-r--r-- 1 root root 312818 Mar 21 17:51 lustre_exporter_test.go -rw-r--r-- 1 root root 2051 Mar 21 17:51 Makefile drwxr-xr-x 4 root root 4096 Mar 21 17:51 proc -rw-r--r-- 1 root root 2896 Mar 21 17:51 README.md drwxr-xr-x 2 root root 4096 Mar 21 17:51 sources drwxr-xr-x 3 root root 4096 Mar 21 17:51 sys drwxr-xr-x 2 root root 4096 Mar 21 17:51 systemd drwxr-xr-x 5 root root 4096 Mar 21 17:51 vendor -rw-r--r-- 1 root root 6 Mar 21 17:51 VERSION

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HewlettPackard/lustre_exporter/issues/135#issuecomment-475169654, or mute the thread https://github.com/notifications/unsubscribe-auth/AJBTwVVix7866OiLH5LMfTFDcDBwq8h0ks5vY1gigaJpZM4TTTh7 .

ldd91 commented 5 years ago

@wutzx Thank you,i tried and it works

lszentannai commented 5 years ago

Hi,

I still have this problem using commit 61775378252b23c794e6725445e2bf0a620d9027, running 2.10.5 on CentOS 7.5. No matter if I use procname_uid or SLURM_JOB_ID.

Is there any fix for this problem?

Thanks, Lorand Szentannai

wutz commented 5 years ago

@lszentannai You can try my fork https://github.com/wutz/lustre_exporter

lszentannai commented 5 years ago

@wutz thanks for quick reply. I did try your fork too, with the same result.

wutz commented 5 years ago

You can execute grep job_id /proc/fs/lustre/obdfilter/*/job_stats to get all job id information, and check whether match regexp:

https://github.com/HewlettPackard/lustre_exporter/pull/137/files#diff-fde95e813ded08bf1be0acad8e83c4cfR665

lszentannai commented 5 years ago

it looks like it's not matching the last jobid changing the regex to (?ms:job_id:.*?(-|\\z|$)) does, but won't help.

I get the same messages again, like:

Apr 30 13:01:46 oss-1 lustre_exporter[23544]: ter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"293395\" > label:<name:\"operation\" value:\"punch\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"293395\" > label:<name:\"operation\" value:\"destroy\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"293395\" > label:<name:\"operation\" value:\"create\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"293395\" > label:<name:\"operation\" value:\"get_info\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"293395\" > label:<name:\"operation\" value:\"set_info\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"293395\" > label:<name:\"operation\" value:\"quotactl\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"getattr\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"setattr\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"statfs\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"sync\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"punch\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"destroy\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"create\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"get_info\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"set_info\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"289305\" > label:<name:\"operation\" value:\"quotactl\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"getattr\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"setattr\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"statfs\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"sync\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"punch\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"destroy\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"create\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"get_info\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"set_info\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290119\" > label:<name:\"operation\" value:\"quotactl\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:\"component\" value:\"ost\" > label:<name:\"jobid\" value:\"290074\" > label:<name:\"operation\" value:\"getattr\" > label:<name:\"target\" value:\"scratch-OST0008\" > counter: was collected before with the same name

gabrieleiannetti commented 2 years ago

Situation has been improved for jobstats that do not have any UID set...

...
lustre_job_read_samples_total{component="ost",jobid="loop4",target="hebe-OST0263"} 575
lustre_job_read_samples_total{component="ost",jobid="loop4..0",target="hebe-OST0263"} 1395
lustre_job_read_samples_total{component="ost",jobid="loop4.0",target="hebe-OST0263"} 2.989077e+06
lustre_job_read_samples_total{component="ost",jobid="loop4.00",target="hebe-OST0263"} 408
lustre_job_read_samples_total{component="ost",jobid="loop40",target="hebe-OST0263"} 546
lustre_job_read_samples_total{component="ost",jobid="loop40.",target="hebe-OST0263"} 136
lustre_job_read_samples_total{component="ost",jobid="loop40.0",target="hebe-OST0263"} 3.263271e+06
lustre_job_read_samples_total{component="ost",jobid="loop400",target="hebe-OST0263"} 157
lustre_job_read_samples_total{component="ost",jobid="loop4000",target="hebe-OST0263"} 15
...

https://github.com/GSI-HPC/lustre_exporter/commit/6adb8e00676a5b05d86ed9824c03ebda71fb5e62