NagiosEnterprises / ncpa

Nagios Cross-Platform Agent
Other
177 stars 95 forks source link

Wrong disk readings after resizing AIX filesystem #797

Open aljazlipi opened 3 years ago

aljazlipi commented 3 years ago

Hi.

With Nagios XI (version 5.8.2) we are monitoring AIX server (version 7200-05-02-2114). After extending filesystem on this AIX server from 15 TB to 16 TB we are getting wrong filesystem usage for /OraData. For all other filesystems disk usage is showing correct values. Filesystem is JFS2.

Here is the output on AIX server:

server@user /home/user# df -g Filesystem GB blocks Free %Used Iused %Iused Mounted on /dev/hd4 0,88 0,38 57% 18194 17% / /dev/hd2 6,12 1,09 83% 77560 22% /usr /dev/hd9var 2,50 0,91 64% 15507 7% /var /dev/hd3 4,00 3,90 3% 2563 1% /tmp /dev/hd1 2,12 1,20 44% 2024 1% /home /proc - - - - - /proc /dev/hd10opt 3,12 1,75 45% 24108 6% /opt /dev/livedump 0,25 0,25 1% 4 1% /var/adm/ras/livedump /dev/nmonlv 5,00 2,31 54% 800 1% /nmon /dev/orahomelv 75,00 36,29 52% 286527 4% /OraBase1 /dev/oradatalv 16529,00 905,62 95% 654 1% /OraData /dev/oraarchloglv 765,00 590,33 23% 731 1% /OraArchlog /dev/oradiaglv 50,00 49,59 1% 3578 1% /OraDiag /dev/oraflashlv 150,00 92,52 39% 322 1% /OraFlash /dev/hd11admin 0,12 0,12 1% 11 1% /admin /dev/RedoLog1lv 25,00 4,87 81% 14 1% /RedoLog1 /dev/RedoLog2lv 25,00 4,87 81% 14 1% /RedoLog2 jarovit:/home/razvoj01/prmzav/data 217,31 29,97 87% 271840 4% /home/razvoj01/prmzav/data jarovit:/home/razvoj01/ppm0/data 217,31 29,97 87% 271840 4% /home/razvoj01/ppm0/data jarovit:/home/razvoj01/sifranti/data 217,31 29,97 87% 271840 4% /home/razvoj01/sifranti/data jarovit:/home/razvoj01/skode/data 217,31 29,97 87% 271840 4% /home/razvoj01/skode/data jarovit:/home/razvoj01/docarch/data 217,31 29,97 87% 271840 4% /home/razvoj01/docarch/data

server@root /# /opt/freeware/bin/du -h -d 0 /OraData -c | grep -Ei 'total' 16T total

Best regards, Aljaž

MrPippin66 commented 3 years ago

If you've purchased the commercial version of Nagios, you should be using the commercial support site.

ericloyd commented 3 years ago

So yes, this is the wrong place for it, but I'm not seeing much difference between 16529G and 16T:

server@user /home/user# df -g | grep OraData
 /dev/oradatalv 16529,00 905,62 95% 654 1% /OraData

server@root /# /opt/freeware/bin/du -h -d 0 /OraData -c | grep -Ei 'total'
16T total
aljazlipi commented 3 years ago

Thank you both for you answer. I was given URL to this github place on Nagios forum where I have at first created this issue. Can you be so kind and share a link to comercial support site?

ericloyd; as for not much difference...this are data from monitored AIX server, Nagios XI server is showing this data for /OraData: CRITICAL: Used disk space was -316.60 % (Used: -459.01 GiB, Free: 603.98 GiB, Total: 144.97 GiB)

All others (smaler disks) are ok.

Thank you and sorry for the trouble.

Best regards, Aljaž

ericloyd commented 3 years ago

You did not provide the Nagios output in your original post. The reason someone suggested coming here is to file a bug report with NCPA. But it may not be a bug with NCPA, it may be a configuration within Nagios XI. So you should start there. Check out https://support.nagios.com/forum

MrPippin66 commented 3 years ago

Support for Nagios XI should be in the customer support forum for 'Nagios XI' at:

https://support.nagios.com/forum/

Was that where you originally posted?

aljazlipi commented 3 years ago

Yes, that is where I have originally posted and guys there told me to open a case for developers in Github here on this link.

Sorry, maybe I should have give you more details, but this is my first time here :)

So, here is the thread from Nagios support forum: https://support.nagios.com/forum/viewtopic.php?f=16&t=63008

And here is the summary in case you can not see it there:

Wrong disk readings after resizing disk

Postby alipoglavsek » Thu Jul 08, 2021 7:52 am Hi.

I have the following situation. We have Nagios XI server installed on vmware (version 5.8.2).

NCPA agent (version 2.2.1) is installed on AIX server (7200-05-02-2114) and the readings for disk 'OraData' were correct for half a year now. After extending disk from 15 TB to 16 TB (adding 1 TB), the readings are incorrect.

NCPA Output: CRITICAL: Used disk space was -531.50 % (Used: -770.44 GiB, Free: 915.40 GiB, Total: 144.97 GiB)

Is there some limitation for disk or is this something else?

We did not change the command, it is still the same: check_xi_ncpa!-t '$USER10$' -P 5693 -M 'disk/logical/|OraData' -w '98' -c '99'

Thanks in advance for any information.

Best regars, Aljaž

alipoglavsek

Posts: 11
Joined: Fri Nov 13, 2020 12:37 pm

Top

Re: Wrong disk readings after resizing disk

Postby pbroste » Thu Jul 08, 2021 5:32 pm Hello Aljaž,

Thanks for reaching out about the disk size issue.

Sounds like the size of the disk partition was resized but the filesystem on the partition was not increased to match. Please check with the man pages on resize2fs on this. And here is an article that we also provide as well.

Thanks, Perry

User avatar pbroste

Posts: 173
Joined: Tue Jun 01, 2021 7:27 pm

Top

Re: Wrong disk readings after resizing disk

Postby alipoglavsek » Fri Jul 09, 2021 6:35 am Hi, Perry.

It seems like we have misunderstood.

With Nagios XI we are monitoring AIX server (versin 7200-05-02-2114). After extending filesystem on this AIX server from 15 TB to 16 TB we are getting wrong filesystem usage for /OraData. For all other filesystems disk usage is showing correct values.

Here is the output on AIX server:

lasko@sa.vk /home/sa.vk# df -g Filesystem GB blocks Free %Used Iused %Iused Mounted on /dev/hd4 0,88 0,38 57% 18194 17% / /dev/hd2 6,12 1,09 83% 77560 22% /usr /dev/hd9var 2,50 0,91 64% 15507 7% /var /dev/hd3 4,00 3,90 3% 2563 1% /tmp /dev/hd1 2,12 1,20 44% 2024 1% /home /proc - - - - - /proc /dev/hd10opt 3,12 1,75 45% 24108 6% /opt /dev/livedump 0,25 0,25 1% 4 1% /var/adm/ras/livedump /dev/nmonlv 5,00 2,31 54% 800 1% /nmon /dev/orahomelv 75,00 36,29 52% 286527 4% /OraBase1 /dev/oradatalv 16529,00 905,62 95% 654 1% /OraData /dev/oraarchloglv 765,00 590,33 23% 731 1% /OraArchlog /dev/oradiaglv 50,00 49,59 1% 3578 1% /OraDiag /dev/oraflashlv 150,00 92,52 39% 322 1% /OraFlash /dev/hd11admin 0,12 0,12 1% 11 1% /admin /dev/RedoLog1lv 25,00 4,87 81% 14 1% /RedoLog1 /dev/RedoLog2lv 25,00 4,87 81% 14 1% /RedoLog2 jarovit:/home/razvoj01/prmzav/data 217,31 29,97 87% 271840 4% /home/razvoj01/prmzav/data jarovit:/home/razvoj01/ppm0/data 217,31 29,97 87% 271840 4% /home/razvoj01/ppm0/data jarovit:/home/razvoj01/sifranti/data 217,31 29,97 87% 271840 4% /home/razvoj01/sifranti/data jarovit:/home/razvoj01/skode/data 217,31 29,97 87% 271840 4% /home/razvoj01/skode/data jarovit:/home/razvoj01/docarch/data 217,31 29,97 87% 271840 4% /home/razvoj01/docarch/data lasko@sa.vk /home/sa.vk#

Best regards, Aljaž

alipoglavsek

Posts: 11
Joined: Fri Nov 13, 2020 12:37 pm

Top

Re: Wrong disk readings after resizing disk

Postby pbroste » Fri Jul 09, 2021 6:43 pm Hello Aljaž,

Thanks for following up with the details. We see that your '/dev/oradatalv 16529,00 905,62 95% 654 1% /OraData' results in correct total size.

We want to get some further data points from this mount point to determine what is going on.

Please run the following and provide the results:

Code: Select all du -h -d 0 /OraData -c | grep -Ei 'total'

Verbose output on the ncpa command:

Code: Select all /usr/local/nagios/libexec/check_ncpa.py -H [yourhostip_or_name] -t '[your_ncpa_token]' -P 5693 -M 'disk/logical/|OraData' -w '98' -c '99' -v

Thanks, Perry

User avatar pbroste

Posts: 173
Joined: Tue Jun 01, 2021 7:27 pm

Top

Re: Wrong disk readings after resizing disk

Postby ssax » Fri Jul 09, 2021 6:53 pm What type of filesystem is it? (JFS/JFS2/ext4/etc)

Did you restart the ncpa_listener service after and see if that resolves it?

NCPA uses the psutil python library to get the information, I was seeing some PPC 16TB limits for memory/JFS/JFS2 while researching this but I'm not sure how that translates into what psutil is reading from the backend or if it's related at all. Be sure to check out our Knowledgebase for helpful articles and solutions!

User avatar ssax Dreams In Code

Posts: 6879
Joined: Wed Feb 11, 2015 6:54 pm

Top

Re: Wrong disk readings after resizing disk

Postby alipoglavsek » Thu Jul 29, 2021 7:53 am Perry, hi.

Sorry for late reply, but I have been absent.

Here are the outputs you asked for:

dev-srv-devana@root /# /opt/freeware/bin/du -h -d 0 /OraData -c | grep -Ei 'total' 16T total

And the second one for this command: /usr/local/nagios/libexec/check_ncpa.py -H [yourhostip_or_name] -t '[your_ncpa_token]' -P 5693 -M 'disk/logical/|OraData' -w '98' -c '99' -v

File returned contained: { "returncode": 2, "stdout": "CRITICAL: Used disk space was -448.30 % (Used: -649.84 GiB, Free: 794.80 GiB, Total: 144.97 GiB) | 'used'=-649.84GiB;142;144; 'free'=794.80GiB;142;144; 'total'=144.97GiB;142;144;" } CRITICAL: Used disk space was -448.30 % (Used: -649.84 GiB, Free: 794.80 GiB, Total: 144.97 GiB) | 'used'=-649.84GiB;142;144; 'free'=794.80GiB;142;144; 'total'=144.97GiB;142;144;

BTW: filesystem is JFS2

Best regards, Aljaž

alipoglavsek

Posts: 11
Joined: Fri Nov 13, 2020 12:37 pm

Top

Re: Wrong disk readings after resizing disk

Postby ssax » Fri Jul 30, 2021 12:59 am Please create a bug report for this here with your AIX system info/oslevel/etc so that the developers can investigate the issue:

https://github.com/NagiosEnterprises/ncpa/issues

You may need to use a plugin with NPCA as a workaround until they release a fix, it has to be related to psutils because that's where the data is taken from.

See here:

https://exchange.nagios.org/directory/Plugins/Operating-Systems/AIX/AIX-5-2E3-2F6-2E1-2F7-2E1--2D-Check-Filesystems/details

And here:

https://support.nagios.com/kb/article/nagios-xi-using-scripts-plugins-with-ncpa-722.html Be sure to check out our Knowledgebase for helpful articles and solutions!

User avatar ssax Dreams In Code

Posts: 6879
Joined: Wed Feb 11, 2015 6:54 pm

Top

Re: Wrong disk readings after resizing disk

Postby alipoglavsek » Wed Aug 04, 2021 10:56 am Bug issue created, thank you.

BR, Aljaž

MrPippin66 commented 3 years ago

Thanks for the details.

I'll have to let one of the active developers comment, but my initial stance would be that this is being caused by a boundary issue in the total size of the filesystem, and the current AIX python v2 implementation being 32 bit.

You're at the latest version of NCPA for AIX, and they ran into problems compiling the next version with the current AIX python.

They're pending updating the AIX agent till V3.0.0 of the NCPA agent becomes available, and they will build this on Python v3, which will be 64-bit and likely won't have this issue.

aka17034 commented 2 years ago

Bumping this issue, as I suspect I am seeing similar issue, after resizing from 14 TB to 30 TB.

The NCPA logical disk check seems to simply return the last values from when file system was 14 TB, prior to the resize.

Are there any active developers monitoring this issue who'd be able to confirm that this will be fixed in next release of NCPA for AIX?

And any update on when that next release of NCPA will be available?

Thanks.