Icinga / icinga-powershell-plugins

A collection of Windows check plugins for the Icinga PowerShell Framework
GNU General Public License v2.0
51 stars 28 forks source link

Add DiskHealth support for cluster systems #90

Closed ErwinE closed 3 years ago

ErwinE commented 4 years ago

Hello,

the "IcingaCheckDiskHealth" check reports warnings on our Virtualization Clusters. It´s because there are always disks which are "offline" or "readonly" and will only be online if the specific node use it. We can´t exlclude these disk/partitions, because the usage of each disk changes from time to time.

image

We need an option that allows us to tell the check, that it won´t report warning alerts if there are disks "Offline" or "ReadOnly".

Thanks!

Best Regards, Erwin

LordHepipud commented 4 years ago

Hello and thank you for the issue. Would you be able to test the plugin changes which can be found in the linked PR and test it?

I added two new flags to the pluginn which will tell the plugin to return Ok instead of Warning if a disk is in this specific state.

ErwinE commented 4 years ago

Hello LordHepipud,

these options work for us now. Thanks!

There is second thing which we noticed on our cluster systems. There are errors on some systems, which are produced by every version of the script (new and old). The check seems to work and the result is "OK", but the error is still there in the output. Do you have a guess where this error comes from?

`Exception calling "Join" with "2" argument(s): "Value cannot be null. Parameter name: values"

At C:\Program Files\WindowsPowerShell\Modules\icinga-powershell-plugins\plugins\Invoke-IcingaCheckDiskHealth.psm1:201

char:17

Best Regards, Erwin

LordHepipud commented 4 years ago

Thank you very much - I will look into this issue. I assume this issue is not applying if you run the plugin with a local admin user? I would assume this is caused by an uncatched permission issue then.

ErwinE commented 3 years ago

Thank you very much - I will look into this issue. I assume this issue is not applying if you run the plugin with a local admin user? I would assume this is caused by an uncatched permission issue then.

Yes, we tested it with admin privileges and the error disappeared. Thanks for your efforts!

LordHepipud commented 3 years ago

After some digging I realised that I'm not sure if this issue is still a bug. We did some changes already to the disk provider, to ensure the operational status should in general not be empty.

Are you running the latest master of the PowerShell Plugins and the PowerShell Framework? Would you be able to test this again if not?

In case the bug still occures, we will have to dig a little deeper.

Thank you in advance!

ErwinE commented 3 years ago

Hello,

we only tested the 1.2.0 version and added the code for ignoring offline/readonly disk. I wasn´t aware of changed code in master regarding to this problem.

We tested it now with the master version and the error disappeared. I close this issue now.

Thank you very much!

ErwinE commented 3 years ago

Unfortunately it´s not working directly with Icinga2 and IcingaWeb2. The error disappeared but there are now new critical alerts:

Output in Icingaweb2 with 1.2.0 version: image image

Output in Icingaweb2 with master version: (No Powershell error) image

Output in Powershell with admin and master version: (No Powershell error) image

LordHepipud commented 3 years ago

Interesting, so the operational status seems to be empty. Can you please check which user is running the check on the Icinga Agent? Is it the default one NT Authority\NetworkService or a different one? Are the permissions set properly for the required WMI trees? In general, this should be covered but a exception that not all permissions are set.

Can you pleae modify the plugin .psm1 to run the following code after the param() call:

Write-IcingaConsoleNotice '{0}' -Objects (Get-IcingaWindowsInformation MSFT_PhysicalDisk -Namespace 'root\Microsoft\Windows\Storage' | Out-String);

This will print the entire output directly into the plugin. There is a member OperationalStatus which should contain value. If the value is empty, it might be that the user is not having full access to the WMI tree root\Microsoft\Windows\Storage.

In this case it is required to add permissions to this object including all sub-objects. We also added Knowdledge Base Article IWKB000001 for this.

Please note: The Wmi management commands will be shipped with 1.3.0 and are only present within the master yet.

LordHepipud commented 3 years ago

I also just made a PR into the Icinga PowerShell Framework Master for future, easier troubleshooting: #155.

With this you can now simply enable the debug mode of the Framework and check the event log there for all data fetched over Wmi/Cim

ErwinE commented 3 years ago

DiskHealth_Output.txt

I uploaded the output with the "Write-IcingaConsoleNotice" command. I just censored the hostname. We change the user that runs Icinga2 to LocalService by default during the icinga installation, because of security requirements. And until now we have never needed the permissions from NetworkService. We have only a few systems where LocalSystem is needed. Our virtuliazation admin also told me, that the critical alert about disk 5 is also appearing in the admin powershell console now. Either the new line changed something or the last time we checked, it cached the 1.2.0 Powershell Plugins somehow in the powershell console and used that instead of the master files.

LordHepipud commented 3 years ago

As far as I can tell the problem must be located inbetween somewhere. From what I see on the output is that disk 5 is present in the check output, but the MSFT_PhysicalDisk physical disk class seems no to know about it. The output contains disks with device id 0-4, which is in total 5.

Right now I'm wondering if the mapping is correct of the device, but this is something hardly to test without access to a system throwing issues.

Do you have a list of disks available for a system with such an issue? I would like to analyse the behaviour more.

Could you please also post the content for

Get-IcingaWindowsInformation Win32_DiskDrive

and

Get-IcingaWindowsInformation Win32_LogicalDisk -Filter 'DriveType = 3';

Thank you in advance!

ErwinE commented 3 years ago

Hi,

here is the information you requested:

PS C:\Windows\system32> Get-IcingaWindowsInformation Win32_DiskDrive

DeviceID           Caption                             Partitions Size          Model                              
--------           -------                             ---------- ----          -----                              
\\.\PHYSICALDRIVE4 DELL MD34xx  Multi-Path Disk Device 2          2064545280    DELL MD34xx  Multi-Path Disk Device
\\.\PHYSICALDRIVE3 DELL MD34xx  Multi-Path Disk Device 2          7613993640960 DELL MD34xx  Multi-Path Disk Device
\\.\PHYSICALDRIVE0 DELL PERC H330 Adp SCSI Disk Device 3          399427822080  DELL PERC H330 Adp SCSI Disk Device
\\.\PHYSICALDRIVE1 DELL MD34xx  Multi-Path Disk Device 2          3828950092800 DELL MD34xx  Multi-Path Disk Device
\\.\PHYSICALDRIVE2 DELL MD34xx  Multi-Path Disk Device 2          7586628134400 DELL MD34xx  Multi-Path Disk Device

PS C:\Windows\system32> Get-IcingaWindowsInformation Win32_LogicalDisk -Filter 'DriveType = 3';

DeviceID DriveType ProviderName VolumeName Size         FreeSpace   
-------- --------- ------------ ---------- ----         ---------   
C:       3                                 398836363264 296345534464
LordHepipud commented 3 years ago

Thank you very much!

Honestly I'm not quite sure, but I believe the mapping of the different informations is wrong. I have no idea why there is a Disk #5 in on this system, while clearly the range goes from 0-4 only.

Could please (sorry to bother you) run this command and share the output please?

$testdata = Join-IcingaPhysicalDiskDataPerfCounter -DiskCounter @(
    '\PhysicalDisk(*)\disk read bytes/sec',
    '\PhysicalDisk(*)\disk write bytes/sec',
    '\PhysicalDisk(*)\disk reads/sec',
    '\PhysicalDisk(*)\disk writes/sec',
    '\PhysicalDisk(*)\avg. disk sec/read',
    '\PhysicalDisk(*)\avg. disk sec/write',
    '\PhysicalDisk(*)\avg. disk sec/transfer',
    '\PhysicalDisk(*)\current disk queue length',
    '\PhysicalDisk(*)\avg. disk queue length'
);

foreach ($entry in $testdata.Keys) {
     $Disk = $testdata[$entry];
     Write-Host '#### Disk' $entry;
     Write-Output ($Disk.Data | Out-String )
     Write-Output ($Disk.PerfCounter.values)
}

By using this I can undertand better which data is collected and how they are comined together later one based on the previous shared information.

Sorry for the trouble !

ErwinE commented 3 years ago

No problem and thank you for your help!

I put the long output in a text file and uploaded it: Output.txt

Best Regards, Erwin

LordHepipud commented 3 years ago

Thank you very much for the data!

I'm just wondering on why the plugin is outputting a Disk #5 while all data contains only up to disk #4. Is there anything I'm missing here?

ErwinE commented 3 years ago

I also have no ideas. My only suggestion would be before we waste more time, that we schedule a meeting together (me, virtualization admin and you) for half an hour and we have a look at the system together. We could do that in Teams and in German language.

Just let me know if that would be an option for you and if so, please tell me on which dates it would be good for you. Then I would send you a private invitation to your E-Mail address.

Thank you very much!

LordHepipud commented 3 years ago

Yes we can do that. I would be available today - just send me an invite please.

ErwinE commented 3 years ago

Okay, thanks!

I sent you the invite to your contact address which i found on your website.

LordHepipud commented 3 years ago

Session done. The problem is actual a disk not providing the status, which we agreed is fine in this case. Excluding these specific disks would resolve the issues.

Thanks for the testing and the feedback!