canonical / prometheus-hardware-exporter

Prometheus Hardware Exporter is an exporter for Hardware Observer
GNU General Public License v3.0
7 stars 9 forks source link

Invalid session resource errors cause CollectorFailed alerts #76

Open peterctl opened 3 weeks ago

peterctl commented 3 weeks ago

Using the hardware-observer charm, the CollectorFailed alerts fire multiple times a day, generating a lot of noise for the alerts on-call. The alerts seem to match with these errors showing up in the systemd service logs:

ubuntu@hostname:~$ journalctl -u hardware-exporter.service | grep ERROR | tail -n 10
Jun 25 16:54:35 hostname python3[$PID]: 2024-06-25 16:54:35 ERROR Invalid session resource: /redfish/v1/SessionService/Sessions/102390, return code: 401
aieri commented 1 week ago

we should look into having the collector try to refresh the session a few times before logging the error - i.e. those log lines should probably not make it to syslog at all

peterctl commented 21 hours ago

Here are some more logs around the error message for extra context.

# journalctl -u hardware-exporter.service | grep -B30 ERROR
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Login returned code 201: {"@Message.ExtendedInfo":[{"Message":"The resource has been created successfully.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"Base.1.12.Created","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"None.","Severity":"OK"},{"Message":"A new resource is successfully created.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"IDRAC.2.8.SYS414","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"No response action is required.","Severity":"Informational"}],"@odata.context":"/redfish/v1/$metadata#Session.Session","@odata.id":"/redfish/v1/SessionService/Sessions/132562","@odata.type":"#Session.v1_6_0.Session","ClientOriginIPAddress":"IP_FOR_BOND_M","CreatedTime":"2024-07-16T13:59:10-05:00","Description":"User Session","Id":"132562","Name":"User Session","Password":null,"SessionType":"Redfish","UserName":"root"}
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting redfish sensor info...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1/Memory/DIMM.Socket.A4: 0.1320378198288381 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting smart storage health data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.23425848316401243 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting processor data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.01082204794511199 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting storage controller data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.010600307956337929 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting network adapter data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.0182824837975204 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting chassis data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.4267375646159053 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting storage drive data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.03342375298961997 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting memory dimm data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.01948211621493101 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting smart storage health data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.09171967394649982 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/SessionService/Sessions/132562
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for DELETE to /redfish/v1/SessionService/Sessions/132562: 0.045607013162225485 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 ERROR Invalid session resource: /redfish/v1/SessionService/Sessions/132562, return code: 401
gabrielcocenza commented 14 hours ago

Apparently, redfish is failing to delete the session during the logout with 401 (UNAUTHORIZED). It's very strange because some milliseconds before it was able to fetch information without problem. I'm afraid we'll need to have a try except during the logout and handle this situation.