door7302 / openjts

The Open Juniper Telemetry Stack Project
BSD 3-Clause "New" or "Revised" License
43 stars 7 forks source link

Push hardware information to OpenJTS #39

Open nguyenduchoa37 opened 2 months ago

nguyenduchoa37 commented 2 months ago

Hi.

I install OpenJST to monitor MX960 with profile Heal Monitoring Profile. I make a test by shutting down FPC on MX960 (using the command request fpc slot offline). But I cannot see any alarm or warning on Grafara Web Gui (up to 4-5 minutes). Is there any way to detect fast the hardware error with OpenJTS?

door7302 commented 2 months ago

Hello,

Could you please share the Junos version and the model of card.

David

nguyenduchoa37 commented 2 months ago

Hello,

Could you please share the Junos version and the model of card.

David

Hi.

I test with MX960 Junos: 20.4R3-S8.1, using MPC10E. But if this error appears on Grafana, which kind of this log ? And how many seconds this log will exist since that card is down on box?

door7302 commented 2 months ago

I believe manually shutting down an MPC is not considered an error. If you want, I could provide you a command to simulate an HW error in your lab.

nguyenduchoa37 commented 2 months ago

Yes, please share me that command. Anw, so if I unplug fpc, it's ok for testing?

On Thu, Jun 27, 2024, 5:34 PM David Roy @.***> wrote:

I believe manually shutting down an MPC is not considered an error. If you want, I could provide you a command to simulate an HW error in your lab.

— Reply to this email directly, view it on GitHub https://github.com/door7302/openjts/issues/39#issuecomment-2194345367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBEHHTNMCYEBHXUCCMJUYDZJPTD7AVCNFSM6AAAAABJ27ICDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJUGM2DKMZWG4 . You are receiving this because you authored the thread.Message ID: @.***>

door7302 commented 2 months ago

FOR LAB ONLY

1/ start shell pfe network fpcX.0 <<< X = slot number

2/ show cmerror module <<<< Identify the module ID for “Storage device” - in my case this is 5

3/ show cmerror module 5

Error-id PFE Level Threshold Count Occured Cleared Last-occurred(ms ago) Name 0x2c0002 0 Major 1 0 0 0 0 CPU_CMERROR_STORAGE_MSATA_DISABLED 0x2c0001 0 Minor 1 0 0 0 0 CPU_CMERROR_STORAGE_SMARTD_ERROR 0x2c0003 0 Minor 1 0 0 0 0 CPU_CMERROR_STORAGE_ACCESS_ERROR

Pick up the hexa ERROR-ID of a MAJOR error and its description and simulate the Error:

4/ test cmerror trigger-error 0x2c0002 0 CPU_CMERROR_STORAGE_MSATA_DISABLED 5

5/ exit

Now you should see a MAJOR ALARM

6/ regress@rtme-mx-25> show chassis alarms 3 alarms currently active Alarm time Class Description 2024-06-28 06:30:54 PDT Major FPC 2 Major Errors

On openJTS you should see:

image

To clear the alarm you need to reboot

nguyenduchoa37 commented 1 month ago

Sorry for the late reply due to missing your email.

I will test and update you soon.

On Mon, Jul 8, 2024, 11:46 PM David Roy @.***> wrote:

Reopened #39 https://github.com/door7302/openjts/issues/39.

— Reply to this email directly, view it on GitHub https://github.com/door7302/openjts/issues/39#event-13426890668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBEHHWETHI4VT6XKV3CP6LZLK67LAVCNFSM6AAAAABJ27ICDKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGQZDMOBZGA3DMOA . You are receiving this because you authored the thread.Message ID: @.***>

door7302 commented 4 weeks ago

Any updates?

nguyenduchoa37 commented 4 weeks ago

The result is as expected.

@.**_AGG02_TEST_SRT_ZTE> show chassis alarms 3 alarms currently activeAlarm time Class Description2024-08-08 09:39:27 +07 Major FPC 2 Major Errors2024-08-07 13:57:16 +07 Minor CB 0 Removed2024-08-05 14:19:29 +07 Minor Backup RE Active

But may I know which mechanism that Grafana can show this error ? Still streaming via gRPC? Because I see the notification does not appear immediately, it still need a time to refresh. Thanks Regard Nguyen Duc Hoa (Mr)

Vào Th 5, 8 thg 8, 2024 vào lúc 00:01 David Roy @.***> đã viết:

Reopened #39 https://github.com/door7302/openjts/issues/39.

— Reply to this email directly, view it on GitHub https://github.com/door7302/openjts/issues/39#event-13795378186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBEHHUPJYXTSVHJNW3XPL3ZQJHGJAVCNFSM6AAAAABJ27ICDKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTG44TKMZXHAYTQNQ . You are receiving this because you authored the thread.Message ID: @.***>