Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.63k stars 404 forks source link

Realtime not working on remote pollers for certain data query #993

Closed g1augusto closed 6 years ago

g1augusto commented 7 years ago

This is a new issue ticket I opened to continue from an old one unresolved:

https://github.com/Cacti/cacti/issues/711


I apologize for not being able to follow up actively on this but I believe the issue is still there as real time graphs works only on sources queried by the main poller.

Can we resume this topic, I think IP SLA realtime graphing is one main feature of CACTI.

Along that I have a feeling that with bigger cluster the way realtime graph works can be a bit daunting, it may be better to open directly an HTTP connection to the remote poller rather than on the main, indeed it would create a dependency but it would facilitate the traffic flow I believe

cigamit commented 7 years ago

Please setup a separate test environment and just this device as a remote poller. Setup cmd.php as the data collector and let me know that cmd.php is working. Thanks. As I mentioned before, your primary poller is running spine and doing the collection properly, but Realtime uses an emulation of cmd.php. So, I would expect you to have issues if switching the poller to cmd.php.

Let us know what your findings are. Balls in your court.

g1augusto commented 7 years ago

Sure, I am going to prepare a separate installation as test environment.

Just please, have some patience if I don't reply instant immediately as I am also overloaded at my job but appreciate really your help.

cigamit commented 7 years ago

No problem. We've got a lot of issues and features to keep us busy. I'm suspecting something either in the communication, or more likely the SNMP API.

g1augusto commented 6 years ago

Hi,

Just wanted to leave a comment on this topic:

The RRD files related to these specific Template DS are quite big (in the order of 6 or 4 MB), so I am thinking that given the fact that this issue is seen only on remote pollers, may it be that the realtime process is just showing no data because it cannot really produce that data in a 10 or 30 seconds?

My understanding is that the poller data is retrieved from the remote poller and sent to the main poller window to show the realtime data, wouldn't it be better to produce the temporary realtime graph on the remote poller and via a pop up (old style) have a window pointing on the remote poller rather than the main poller?

Another alternative is to keep the realtime window on the main poller but to not collect remotely the poller data, just instead an image produced and sent by the remote poller (should be very low data in that case)

cigamit commented 6 years ago

You may have a point. That if there are a large number of local_data_ids, say for an aggregate graph, or if you build your graphs from many local data id's, it might be much faster to have the realtime data stored on the remote server, and just request the graph data in the end.

Why don't you take a crack at creating a pull request to achieve this?

g1augusto commented 6 years ago

I am afraid I don't have the necessary knowledge to achieve this by myself.

I am glad if you could put this into your roadmap of removing dependencies from the main poller.

cigamit commented 6 years ago

Okay, it's on the 76 issue heap until time permits. Right now, it's not a real high priority.

g1augusto commented 6 years ago

Hi and happy new year,

I was looking again into this problem of realtime graph for a specific template ( that is IP SLA graph).

I noticed that the issue happens on realtime from remote pollers (even if the remote poller is hosted on the same site and same subnet as the main poller), the poller_realtime table is populated but all values gathered are 0

NOTE : This template has 9 items that are gathered (can be seen from the graph screenshot), I have tried with templates that has 1 , 2 , 4 items and realtime worked fine (a bit slow but ok). So I am thinking this may also be related to the number of items items that builds up a graph and ALSO means that more than 1 realtime graph may be too much for a system :(

image image

Instead for devices polled directly from the main poller there isn't such an issue: PS : Is it normal all of them has the same poller_id value even for different pollers?

image

netniV commented 6 years ago

Is the IP SLA a template that you have designed? I'm just wondering if I can test anything here to see a similar issue but I can't seem to find that template.

g1augusto commented 6 years ago

Hi netniV,

No, I didn't design it, we retrieved it from the community and used in 0.8.x versions of CACTI, I reimported and fit into 1.x versions, you can find it opening the original thread :

https://github.com/Cacti/cacti/issues/711

Thank you for taking some time to look into this, let me know if you need me to provide more info.

netniV commented 6 years ago

Can you at least provide a link to the original template or export yours ? I've looked at the [url=https://docs.cacti.net/plugins]plugins[/url] section on the website and this repo but can't seem to see it.

p.s. Yeah read through all that previous log too.

g1augusto commented 6 years ago

If I am not mistaken should be this one :

https://forums.cacti.net/viewtopic.php?t=19542

netniV commented 6 years ago

OK, I'll probably add that into my own repo so I can track any changes I make. Is there any chance you can export your template and add it here? Just to make sure that I'm working with the same template as that template may need to be updated to work with my 1.1.29 versions and what changes I make may not have been what you made.

g1augusto commented 6 years ago

This is what is currently used by my CACTI 1.1.28 (rename to .xml)

cacti_data_querycisco-_ip_slastatistics-_5_minutes.txt

cisco_saa.txt

netniV commented 6 years ago

Thanks for that. I'm looking into it and will come back to you.

netniV commented 6 years ago

Unfortunately, whilst we do have quite a few cisco's to query, none of them are actually using IP SLA's so I can't test it against them. I will have a manual scan through the templates though and see if I can uncover anything.

g1augusto commented 6 years ago

Understand and thanks,

In case you could try to set 2 routers in gns3 if you are familiar with it

On Jan 26, 2018 11:03 PM, "netniV" notifications@github.com wrote:

Unfortunately, whilst we do have quite a few cisco's to query, none of them are actually using IP SLA's so I can't test it against them. I will have a manual scan through the templates though and see if I can uncover anything.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Cacti/cacti/issues/993#issuecomment-360916964, or mute the thread https://github.com/notifications/unsubscribe-auth/AZNRkDqMlklYsrjghYmWGNswJ4X7UlGNks5tOkvIgaJpZM4PaosI .

cigamit commented 6 years ago

@jhonnyx82, did you remove php_snmp from your php on both sides? Just curious.

g1augusto commented 6 years ago

@jhonnyx82, did you remove php_snmp from your php on both sides? Just curious.

On both sides, meaning on the main and remote pollers? No

This is from tech support - php info in CACTI

image

[PHP Modules] bz2 calendar Core ctype curl date dom ereg exif fileinfo filter ftp gd gettext gmp hash iconv json ldap libxml mbstring mhash mysql mysqli openssl pcntl pcre PDO pdo_mysql pdo_sqlite Phar posix readline Reflection session shmop SimpleXML snmp sockets SPL sqlite3 standard sysvmsg sysvsem sysvshm tokenizer wddx xml xmlreader xmlwriter xsl zip zlib

cigamit commented 6 years ago

Please disable it, and then see if there is any difference. You will have to restart apache on both sides.

g1augusto commented 6 years ago

I disabled snmp module by commenting in /etc/php.d/snmp.ini the following way :

cat /etc/php.d/snmp.ini 
; Enable snmp extension module
;extension=snmp.so

running php -m I do not see anymore the snmp module listed

php -m [PHP Modules] bz2 calendar Core ctype curl date dom ereg exif fileinfo filter ftp gd gettext gmp hash iconv json ldap libxml mbstring mhash mysql mysqli openssl pcntl pcre PDO pdo_mysql pdo_sqlite Phar posix readline Reflection session shmop SimpleXML sockets SPL sqlite3 standard sysvmsg sysvsem sysvshm tokenizer wddx xml xmlreader xmlwriter xsl zip zlib

[Zend Modules]

I restarted httpd service and obviously I did it all on both the pollers (main and remote), but the outcome is the same, still data is 0.

cigamit commented 6 years ago

okay, one more thing, update lib/snmp.php from develop. Then, create a tcpdump from the remote server to the device you are polling, and with wireshark, see if the correct snmpdata is being requested and returned from the device.

g1augusto commented 6 years ago

I tried as suggested but unfortunately is always the same outcome, always data with value 0.

Now from Wireshark I can see that for the related requests there is data returned to the remote poller, so probably the issue is how the data is retrieved from the remote poller to the main poller to build up the graph (below a sample).

Another info, I normally use a remote poller that is geographically in another location and connected through an MPLS VPN but I also have a "remote" poller that is installed in the same location of the main poller, I tried to use it to query the device and still I had value 0 to each OID returned to the main poller (this to eliminate any issue with latency from the remote site).

The only way I can get this to work is to have directly the main poller to query the device.

Request : image

Response : image

cigamit commented 6 years ago

That looks like the main data collector output. Put the realtime sample frequency at say every 5 seconds, and look for the individual snmpget requests. You should get one every 5 seconds or so, and it should include only the OID's for the graph.

g1augusto commented 6 years ago

My bad,

Now I made sure to run the test away from a polling cycle and this is the result :

seems like the remote poller query only one single OID of all the ones he should, the rest somehow stays at value 0 from what I read from the database realtime_poller_output table but I can see that they never got a get-request SNMP. Also the time of the queries seems not in sync with the 5 seconds realtime I requested from the web interface.

PS : I can see also that each one is a single get-request, while remanaging this wouldn't be more efficient to get all the necessary SNMP requests following the max OID set for that device?

image

cigamit commented 6 years ago

The poller realtime will only get what is required to poll the device. It is the most efficient. Are you saying that the device is responding with bad data now? Are these counters or gauges?

g1augusto commented 6 years ago

Now from last 2 entries I wrote, I feel that the remote poller request always the same OID instead of processing all the 9, as it happens, that specific OID is usually at 0. My guess is that the remote poller just do always the same snmpget and insert the result for all the variables of the other OIDs (that are never retrieved).

Also 9 snmpgets with 1 OID instead of a single snmpget with 9 OIDs seems not optimal, but of course it's just my opinion (sorry)

cigamit commented 6 years ago

No, that is good information. It'll be a few days before I get back to this though (likely). It's getting late.

cigamit commented 6 years ago

You figured it out. But it's a tricky solution. I have to figure out how to solve this one.

g1augusto commented 6 years ago

Thanks for looking into this,

Anytime you can will be fine, I know you have also other tasks and a life :)

g1augusto commented 6 years ago

Just to confirm our suspicion: I ran a test for a switch interface connecting one of our MPLS CEs, I just noticed that in realtime upstream and downstream traffic is the same so only one OID is queried (As i can guarantee our traffic there is NOT symmetric):

image

In this case also shows that for a smaller number of OIDs building up a graph, somehow the data retrieval works faster, and well this may add up for the tip of collecting in one SNMP get-request all the necessary OIDs instead of by multiple SNMP get-requests

cigamit commented 6 years ago

I'm looking at passing the entire list of OID's in a single remote_agent.php call per host. Then return an array of values. It won't make it into 1.1.34, but I'm allocating some time this weekend to this.

g1augusto commented 6 years ago

Thanks again, much appreciated!

cigamit commented 6 years ago

Please update remote_agent.php and cmd_realtime.php from develop and test again. All remote agent calls per host are now being pooled.

g1augusto commented 6 years ago

Sorry for not being able to test earlier, I will run some test 12 hrs from now (my tomorrow morning CET).

I checked that there were an issue fixed in #1315

will confirm then.

Thanks

cigamit commented 6 years ago

Should be all set now. You have 3 files to update from 1.1.34.

g1augusto commented 6 years ago

Hi,

I applied the 3 files from the develop branch, on devices queried from the main poller looks OK but on the remote ones I have the following log:

image

and the realtime popup window shows this :

image

In case you need I have the TCP dump from the remote poller but I would send them to you in private if possible

cigamit commented 6 years ago

Okay, that is odd. I'm online now. Let me check on that error.

cigamit commented 6 years ago

Oh, sorry, update remote_agent.php and reply again.

cigamit commented 6 years ago

So, the update was cmd_realtime.php, remote_agent.php, realtime.js. I think that was all three.

g1augusto commented 6 years ago

I updated also remote_agent.php and ran a test.

on the remote poller the backtrace log is disappeared but I can tell that only one of the OIDs are polled for a graph:

this is an example of a memory usage (used and free OIDs): only free memory is polled

image

This DS is built on a query of at least 2 OIDs:

.1.3.6.1.4.1.9.9.48.1.1.1.6.x for Free memory .1.3.6.1.4.1.9.9.48.1.1.1.5.x for Used memory

But I can see that only free memory is polled (this is a tcpdump file from the remote poller).

image

Let me know if you will have any update, but I will be able to run more tests only later during the day

cigamit commented 6 years ago

That is unusual. And not expected. How about the more complicated one? What template is this?

cigamit commented 6 years ago

Run this query too, post the results:

SELECT DISTINCT dtr.local_data_id, dl.host_id FROM graph_templates_item AS gti LEFT JOIN data_template_rrd AS dtr ON gti.task_item_id=dtr.id LEFT JOIN data_local AS dl ON dl.id=dtr.local_data_id WHERE gti.local_graph_id = ? AND dtr.local_data_id IS NOT NULL

Replace the ? with the graph id.

cigamit commented 6 years ago

lastly, on line 155 add the following: cacti_log($url);. Post the cacti.log for that one entry removing the hostname from the url.

g1augusto commented 6 years ago

That is unusual. And not expected. How about the more complicated one? What template is this?

This is a Cisco Memory template, I tried anyway also with the SNMP interface traffic shipped with CACTI and had the same outcome

Run this query too, post the results:

MariaDB [cacti]> SELECT DISTINCT dtr.local_data_id, dl.host_id FROM graph_templates_item AS gti LEFT JOIN data_template_rrd AS dtr ON gti.task_item_id=dtr.id LEFT JOIN data_local AS dl ON dl.id=dtr.local_data_id WHERE gti.local_graph_id = 13841 AND dtr.local_data_id IS NOT NULL;
+---------------+---------+
| local_data_id | host_id |
+---------------+---------+
|         14188 |     864 |
+---------------+---------+
1 row in set (0.01 sec)

lastly, on line 155 add the following: cacti_log($url). Post the cacti.log for that one entry removing the hostname from the url.

I added it to remote_agent.php but I do not see logs when I run a realtime graph, is it on the right file? I assumed it was that one.

cigamit commented 6 years ago

No, cmd_realtime.php, but don't bother. Checking the remote agent code now.

cigamit commented 6 years ago

Found the bug. Fixing now.

cigamit commented 6 years ago

I think you will be happy with my last commit.

g1augusto commented 6 years ago

Hi,

Yes I am happy :) thank you, I tested with several remote pollers and now it's working fine.

NOTE : I extracted a tcpdump during the realtime polling and I noticed that it still query each OID in a single SNMP request rather than collecting them in a single SNMP request with multiple OIDs based on the maxOID value of the device settings:

image

Let me know if this is an improvement that you wish to implement, if you think it's not worth it we can close this incident.

netniV commented 6 years ago

I know @cigamit is handling this but, from what I know, all SNMP Get's are done in a single request/response pattern at the moment. If the OID's are consecutive, they could be converted to a walk, with a grep to filter, but since we don't know what OID's a device can return, that may take longer than individual gets. Plus, you would then have to careful when filtering the results as it could be prone to errors.