chexma / checkmk_plugins

1 stars 1 forks source link

After update E-Series NetApp the check failed #10

Open RobSwoss opened 1 month ago

RobSwoss commented 1 month ago

After updating our netapps to the version SANtricity OS 11.80.1R2 the check failed with the following error: [special_netappeseries] requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Details Traceback (most recent call last): File "/omd/sites/mon/lib/python3.12/site-packages/requests/models.py", line 971, in json return complexjson.loads(self.text, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/omd/sites/mon/lib/python3.12/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/omd/sites/mon/lib/python3.12/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/omd/sites/mon/lib/python3.12/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/omd/sites/mon/lib/python3/cmk/special_agents/v0_unstable/agent_common.py", line 149, in _special_agent_main_core return main_fn(args) ^^^^^^^^^^^^^ File "/omd/sites/mon/local/lib/python3/cmk/plugins/netapp_eseries/special_agents/agent_netappeseries.py", line 514, in agent_netapp_eseries_main fetch_storage_data(session, sections, args, base_url, controller_ids) File "/omd/sites/mon/local/lib/python3/cmk/plugins/netapp_eseries/special_agents/agent_netappeseries.py", line 160, in fetch_storage_data ).json() ^^^^^^ File "/omd/sites/mon/lib/python3.12/site-packages/requests/models.py", line 975, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Thanks for your help. Robin

chexma commented 1 month ago

Hi Rob,

based on the given output i can´t see, what the cause is. As I have no systems on that version yet, i will have to download the simulator and try if I can reproduce the error. But that will take some time.

chexma commented 1 month ago

Hi Rob,

unfortunately the E-Series simulator is not available in the new version, i can´t test it that way. You can send me the agent output if the special agent with the --debug flag added, but i can´t promise, if i find the error that way.

exhaustivesolving commented 1 month ago

Hello chexma,

We have about 4x E-Series in production and this started popping up right after we upgraded to 11.80.1R2. I fired up a lab site, installed the extension and pointed it at the monitoring user on one of the e-series systems to get you info. Here is the debug output - please let me know if there is additional info I could provide that might help.

OMD[LAB220]:~$ cmk -nvvp --debug eseries_test
Checkmk version 2.2.0p35
+ FETCHING DATA
  Source: SourceInfo(hostname='eseries_test', ipaddress='192.168.1.10', ident='special_netappeseries', fetcher_type=<FetcherType.SPECIAL_AGENT: 6>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7fb84a6133d0]
Read from cache: AgentFileCache(eseries_test, path_template=/omd/sites/LAB220/tmp/check_mk/data_source_cache/special_netappeseries/{hostname}, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=6)
Not using cache (does not exist)
[ProgramFetcher] Execute data source
Calling: /omd/sites/LAB220/local/share/check_mk/agents/special/agent_netappeseries -u monitor -s '<REDACTED>' --sections batteries,controllers,drawers,drives,esms,fans,interfaces,pools,powerSupplies,system,thermalSensors,trays,volumes 192.168.1.10
[cpu_tracking] Stop [7fb84a6133d0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.28, children_system=0.02, elapsed=0.41999999806284904))]
  Source: SourceInfo(hostname='eseries_test', ipaddress='192.168.1.10', ident='piggyback', fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7fb84a2fb490]
Read from cache: NoCache(eseries_test, path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
[PiggybackFetcher] Execute data source
No piggyback files for 'eseries_test'. Skip processing.
No piggyback files for '192.168.1.10'. Skip processing.
[cpu_tracking] Stop [7fb84a2fb490 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
  HostKey(hostname='eseries_test', source_type=<SourceType.HOST: 1>)  -> Not adding sections: Agent exited with code 1: Traceback (most recent call last):
  File "/omd/sites/LAB220/lib/python3.11/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/omd/sites/LAB220/local/share/check_mk/agents/special/agent_netappeseries", line 501, in <module>
    main()
  File "/omd/sites/LAB220/local/share/check_mk/agents/special/agent_netappeseries", line 497, in main
    fetch_storage_data(session, sections, args, base_url, controller_ids)
  File "/omd/sites/LAB220/local/share/check_mk/agents/special/agent_netappeseries", line 154, in fetch_storage_data
    verify=args.verify_ssl).json()
                            ^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
  HostKey(hostname='eseries_test', source_type=<SourceType.HOST: 1>)  -> Add sections: []
Received no piggyback data
[cpu_tracking] Start [7fb849b242d0]
value store: synchronizing
Trying to acquire lock on /omd/sites/LAB220/tmp/check_mk/counters/eseries_test
Got lock on /omd/sites/LAB220/tmp/check_mk/counters/eseries_test
value store: loading from disk
Releasing lock on /omd/sites/LAB220/tmp/check_mk/counters/eseries_test
Released lock on /omd/sites/LAB220/tmp/check_mk/counters/eseries_test
No piggyback files for 'eseries_test'. Skip processing.
No piggyback files for '192.168.1.10'. Skip processing.
[cpu_tracking] Stop [7fb849b242d0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.010000001639127731))]
[special_netappeseries] requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)(!!), [piggyback] Success (but no data found for this host), execution time 0.4 sec | execution_time=0.430 user_time=0.000 system_time=0.000 children_user_time=0.280 children_system_time=0.020 cmk_time_ds=0.120 cmk_time_agent=0.000
Agent exited with code 1: Traceback (most recent call last):
  File "/omd/sites/LAB220/lib/python3.11/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/omd/sites/LAB220/local/share/check_mk/agents/special/agent_netappeseries", line 501, in <module>
    main()
  File "/omd/sites/LAB220/local/share/check_mk/agents/special/agent_netappeseries", line 497, in main
    fetch_storage_data(session, sections, args, base_url, controller_ids)
  File "/omd/sites/LAB220/local/share/check_mk/agents/special/agent_netappeseries", line 154, in fetch_storage_data
    verify=args.verify_ssl).json()
                            ^^^^^^
  File "/omd/sites/LAB220/lib/python3.11/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)(!!)
exhaustivesolving commented 1 month ago

Update:

We've isolated the problem to the "controllers" section. As a workaround unselecting the "controllers" section allows the extension to function and monitor as normal in check_mk

Comparing:

            name="controllers",
            uri="/controllers",
            perfdata_uri="/analysed-controller-statistics",
            perfdata_identifier="controllerId",

with the web API on the device I can't find these URIs

The API methods: /storage-systems/{system-id}/analyzed/controller-statistics /storage-systems/{system-id}/controller-statistics/{idlist} (depricated)

seem to be related but perhaps the API endpoints changed on the netapp side - I am opening a ticket with netapp to get details on the changes since 11.80.1R2 either doesn't have release info or I'm failing to find it

chexma commented 1 month ago

@exhaustivesolving Wow, thanks for the analysis ! Yeah, the output seems to be changed in the netapp api, what should not happen with a versioned api in a minor upgrade.

chexma commented 1 month ago

As a side note :

https://kb.netapp.com/Support_Bulletins/Customer_Bulletins/SU570

Affected models • E-Series Systems: E2800, E5700, EF280, EF570, EF300, EF600 • StorageGRID Appliances: SGF6024, SG6060 and SG6160

Workaround For systems running any of the affected releases: • Do not upgrade drive firmware until a fix is available in SANtricity OS. OR • Perform offline drive firmware upgrade o For StorageGRID appliances, please visit this page for detailed instructions: <gelöscht>

chexma commented 1 month ago
        name="controllers",
        uri="/controllers",
        perfdata_uri="/analysed-controller-statistics",
        perfdata_identifier="controllerId",

Did they rename analysed to analyzyed-controller-statistics ? Maybe you can try to rewrite the perdata_uri to the new path.

AlexanderGabrielBruchsal commented 4 weeks ago

Maybe you can try to rewrite the perdata_uri to the new path.

we are affected from this issue, too. when i disable controller check, the error goes away. i'd like to test rewriting the url but i don't get it... rewrite to what?

it looks like this:

        name="controllers",
        uri="/controllers",
        perfdata_uri="/analysed-controller-statistics",
        perfdata_identifier="controllerId",
chexma commented 4 weeks ago

Hi,

unfortunately I have no system running on that firmware yet. You can try to change perfdata_uri to /analyzed/controller-statistics

chexma commented 4 weeks ago

If someone has the chance to fetch the API of the analyzed controller statistics data with e.g. postman, I could try to fix the problem without direct access.

AlexanderGabrielBruchsal commented 4 weeks ago

Hi,

changed the url but did not work. But found API doc :) Executing this: curl -X GET "https://HOSTNAME/devmgr/v2/storage-systems/1/analyzed/controller-statistics?statisticsFetchTime=60" -H "accept: application/json"

returns this:

{
  "statistics": [
    {
      "observedTime": "2024-10-31T07:56:04.000+00:00",
      "observedTimeInMS": "1730361364000",
      "sourceController": "CONTROLLERID",
      "readIOps": 37.24333333333333,
      "writeIOps": 20.243333333333332,
      "otherIOps": 0,
      "combinedIOps": 57.486666666666665,
      "readThroughput": 3.9289347330729165,
      "writeThroughput": 0.12590726216634116,
      "combinedThroughput": 4.054841995239258,
      "readResponseTime": 13.0910230516201,
      "readResponseTimeStdDev": 165.25260443364118,
      "writeResponseTime": 0.06098923608202972,
      "writeResponseTimeStdDev": 0.6451068423419701,
      "combinedResponseTime": 6.680828394941147,
      "combinedResponseTimeStdDev": 156.00596177799304,
      "averageReadOpSize": 110618.09719860378,
      "averageWriteOpSize": 6521.81788243043,
      "readOps": 11173,
      "writeOps": 6073,
      "readPhysicalIOps": 37.769999999999996,
      "writePhysicalIOps": 19.930000000000007,
      "controllerId": "CONTROLLERID",
      "cacheHitBytesPercent": 1.5127675037219444,
      "randomIosPercent": 35.5233400985793,
      "mirrorBytesPercent": 0,
      "fullStripeWritesBytesPercent": 0,
      "maxCpuUtilization": 38,
      "maxCpuUtilizationPerCore": [
        38
      ],
      "cpuAvgUtilization": 37.18333333333333,
      "cpuAvgUtilizationPerCore": [
        37.18333333333333
      ],
      "cpuAvgUtilizationPerCoreStdDev": [
        0.3869395588750369
      ],
      "raid0BytesPercent": 0,
      "raid1BytesPercent": 0,
      "raid5BytesPercent": 0,
      "raid6BytesPercent": 0,
      "ddpBytesPercent": 3.1051089614383836,
      "readHitResponseTime": 0.0025785714285714283,
      "readHitResponseTimeStdDev": 0.002450188273240292,
      "writeHitResponseTime": 0.06098923608202972,
      "writeHitResponseTimeStdDev": 0.06098923608202972,
      "combinedHitResponseTime": 0.06048610153885449,
      "combinedHitResponseTimeStdDev": 0.0604858827594107,
      "maxPossibleBpsUnderCurrentLoad": 4847683500,
      "maxPossibleIopsUnderCurrentLoad": 216905
    }
  ],
  "tokenId": null
}

maybe structure of response or parameters changed? statisticsFetchTime is required field: "The number of seconds of historical statistics data to retrieve. After the initial query has started (a token has been provided), this value is ignored."