carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

arm-exporter: error reporting GPU temp #73

Closed pascal71 closed 4 years ago

pascal71 commented 4 years ago

All arm-exporter PODs are up-and-running in the daemonset; however the logs show it is unable to access the GPU. Resulting log entries:

time="2020-07-03T11:21:44Z" level=error msg="gpu collector failed after 0.003312s: exit status 255" source="collector.go:142"

Doing a strace on the rpi_exporter binary shows that it tries to access /dev/vchiq. Which is not in /dev in the arm-exporter container of this POD.

Changing the securityContext to privileged for this POD (e.g. changing the arm-exporter-daemonset.yaml) fixes the problem.

Kind regards,

Pascal van Dam

carlosedp commented 4 years ago

Would you submit a PR? What changes you did to the manifests?

carlosedp commented 4 years ago

Can you check if mounting the /dev from host to the pod fixes the problem?

pascal71 commented 4 years ago

Will also do that :)

Will report back this evening, ok?

On 7/14/20 3:06 PM, Carlos Eduardo wrote:

Can you check if mounting the /dev from host to the pod fixes the problem?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/carlosedp/cluster-monitoring/issues/73#issuecomment-658167804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHNH3IDGPKQB2UMMR2IAA4LR3RJ5FANCNFSM4OPXNKUQ.

pascal71 commented 4 years ago

Good evening Carlos,

I am afraid for mount /dev in a container you will still need the 'privileged' security context. If you use that context, no need for mounting /dev as it's already working then.

E.g.:

   containers:       - command:         - /bin/rpi_exporter         - --web.listen-address=127.0.0.1:9243         image: carlosedp/arm_exporter:latest         name: arm-exporter /   securityContext:/*/ /*/          privileged: true/         resources:           limits:             cpu: 100m             memory: 100Mi           requests:             cpu: 50m             memory: 50Mi

Kind regards,

  Pascal van Dam

On 7/14/20 3:06 PM, Carlos Eduardo wrote:

Can you check if mounting the /dev from host to the pod fixes the problem?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/carlosedp/cluster-monitoring/issues/73#issuecomment-658167804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHNH3IDGPKQB2UMMR2IAA4LR3RJ5FANCNFSM4OPXNKUQ.

carlosedp commented 4 years ago

Yes, I think it's necessary since the utility that reads the temperature requires the device. I'd welcome a PR! Thanks

pascal71 commented 4 years ago

Good afternoon,

Thank your for your reply.

I will, post one this evening.

Small other question;

als have ARM64 / RPI8Gb cluster running; that DOES report CPU (with priv mode) but not GPU temp.

Any idea?

On 7/15/20 3:18 PM, Carlos Eduardo wrote:

Yes, I think it's necessary since the utility that reads the temperature requires the device. I'd welcome a PR! Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/carlosedp/cluster-monitoring/issues/73#issuecomment-658762221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHNH3IFJ7VI5ZKS3EOA2KYTR3WUDLANCNFSM4OPXNKUQ.

carlosedp commented 4 years ago

No idea, I don't have the newer ones. I believe the vcgencmd utiity used by rpi_exporter doesn't support Rpi4 new SOC. Maybe Lukas from the rpi_exporter utility can help. https://github.com/lukasmalkmus/rpi_exporter

pascal71 commented 4 years ago

Good afternoon Carlos,

The arm7hf (32bit) does work for both CPU and GPU on RPi4. On ARM64 only for CPU.

I will contact Lukas. :)

Many thanks for your support.

Kind regards,

  Pascal van Dam

On 7/15/20 3:48 PM, Carlos Eduardo wrote:

No idea, I don't have the newer ones. I believe the vcgencmd utiity used by rpi_exporter doesn't support Rpi4 new SOC. Maybe Lukas from the rpi_exporter utility can help. https://github.com/lukasmalkmus/rpi_exporter

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/carlosedp/cluster-monitoring/issues/73#issuecomment-658778391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHNH3IGHQDUZUOOX4RY2DS3R3WXRJANCNFSM4OPXNKUQ.

carlosedp commented 4 years ago

Closing this as it's not a monitoring stack issue.

pascal71 commented 3 years ago

I will submit a PR for this the solution is to add:

container.mixin.securityContext.withPrivileged(true)

to

arm_exporter.jsonnet

carlosedp commented 3 years ago

Great, thanks for the find!