Open peterctl opened 3 weeks ago
we should look into having the collector try to refresh the session a few times before logging the error - i.e. those log lines should probably not make it to syslog at all
Here are some more logs around the error message for extra context.
# journalctl -u hardware-exporter.service | grep -B30 ERROR
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Login returned code 201: {"@Message.ExtendedInfo":[{"Message":"The resource has been created successfully.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"Base.1.12.Created","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"None.","Severity":"OK"},{"Message":"A new resource is successfully created.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"IDRAC.2.8.SYS414","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"No response action is required.","Severity":"Informational"}],"@odata.context":"/redfish/v1/$metadata#Session.Session","@odata.id":"/redfish/v1/SessionService/Sessions/132562","@odata.type":"#Session.v1_6_0.Session","ClientOriginIPAddress":"IP_FOR_BOND_M","CreatedTime":"2024-07-16T13:59:10-05:00","Description":"User Session","Id":"132562","Name":"User Session","Password":null,"SessionType":"Redfish","UserName":"root"}
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting redfish sensor info...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/Systems/System.Embedded.1/Memory/DIMM.Socket.A4: 0.1320378198288381 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting smart storage health data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.23425848316401243 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting processor data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.01082204794511199 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting storage controller data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.010600307956337929 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting network adapter data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.0182824837975204 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting chassis data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.4267375646159053 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting storage drive data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.03342375298961997 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting memory dimm data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.01948211621493101 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Getting smart storage health data...
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for GET to /redfish/v1/: 0.09171967394649982 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Attempt 1 of /redfish/v1/SessionService/Sessions/132562
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 INFO Response Time for DELETE to /redfish/v1/SessionService/Sessions/132562: 0.045607013162225485 seconds.
Jul 16 13:59:30 hostname python3[3681]: 2024-07-16 13:59:30 ERROR Invalid session resource: /redfish/v1/SessionService/Sessions/132562, return code: 401
Apparently, redfish is failing to delete the session during the logout with 401 (UNAUTHORIZED). It's very strange because some milliseconds before it was able to fetch information without problem. I'm afraid we'll need to have a try except during the logout and handle this situation.
Using the
hardware-observer
charm, theCollectorFailed
alerts fire multiple times a day, generating a lot of noise for the alerts on-call. The alerts seem to match with these errors showing up in the systemd service logs: