Bar values do not correspond with searches

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

Query something like a firewall log :

host:172.16.0.1 FIREWALL_CONNECTION_END.dstip=172.16.0.2 +"Connection timeout" 
limit:1000

Select Report On, All Classes, Hour

A bar graph will be generated with values; however, when you hover over the 
graph and click on it, the number of results returned does not match the value 
in the bar graph. This isn't a limitation of the number of results returned, as 
in my case there were few.

Sometimes clicking on a bar results in 0 results returned once or twice, but 
then the third time the data is returned. When this happens, the 0 result comes 
back almost immediately, as if it's really not trying to search.

Below is an example of the bar graph value vs. the results value (when it did 
return results). I don't see a pattern, but maybe you so:

Beginning at 2012-06-14 18:00:00, descending:

Bar graph:result of clicking on that bar

46:47
2:13
16:9
3:10
47:20
42:45
60:56
47:49
35:47
19:24
27:20
23:28
21:23
21:18
31:29
17:22
24:22
17:28

What is the expected output? What do you see instead?

I expect the values to match. I wonder which one is accurate.

What version of the product are you using? On what operating system?

Latest build as of 6/14; RHEL (actually Oracle Unbreakable (cough) Linux)

Please provide any additional information below.

Original issue reported on code.google.com by lib...@gmail.com on 15 Jun 2012 at 12:50

GoogleCodeExporter commented 8 years ago

I too have occasionally seen this inconsistency, but I haven't been able to 
reproduce it reliably to debug.  Can you tell me a little about your setup:  Do 
you have multiple nodes, are they fairly busy?  When you get these results, is 
there anything in web.log indicating that a node couldn't be reached?

Original comment by mchol...@gmail.com on 15 Jun 2012 at 1:30

Changed state: Started

GoogleCodeExporter commented 8 years ago

Everything is on one box. It is fairly decent too--a 1U server class system 
with 8GB RAM and 2TB 15k RPM drives in a RAID 5. The OS is Oracle Unbreakable 
Linux (RHEL, really). The install is minimal. I'll check the web log when I 
have a chance at work.

Original comment by lib...@gmail.com on 16 Jun 2012 at 3:14

GoogleCodeExporter commented 8 years ago

Oh, and only maybe 100 eps right now. I have two destinations in 
syslog-ng.conf, one for elsa and one for the file system. The file system is 
ext4 and the database, etc are on a dedicated lvm volume. I'm using the Chrome 
browser.

Original comment by lib...@gmail.com on 16 Jun 2012 at 3:17

GoogleCodeExporter commented 8 years ago

Another thing I have found that may be related: if I perform a search and get 
results, then go back to that same tab and hit 'Submit Query' again, I 
sometimes get a different number of results. For example, I had submitted a 
query that returned four results, then three for a few times, then four again.

Original comment by lib...@gmail.com on 24 Jun 2012 at 3:05

GoogleCodeExporter commented 8 years ago

Thanks for the additional info.  I agree that this is related, and I think 
there are two issues.  The first is a time format problem when using GMT on the 
backend for datetime math (which is why report on "day" shows values for 19:00 
hours).  The second may be inconsistent results due to Sphinx swapping out 
indexes during consolidation, or because it does not have enough RAM.  To help 
continue diagnosing this problem, can you tell me approximately how many logs 
per second this instance is processing and how much RAM it has?

Original comment by mchol...@gmail.com on 24 Jun 2012 at 6:05

GoogleCodeExporter commented 8 years ago

I have 16GB RAM and average maybe around 40 EPS right now, which has peaked at 
times to about 200 EPS.

Original comment by lib...@gmail.com on 24 Jun 2012 at 6:55

GoogleCodeExporter commented 8 years ago

Ok, the relatively low events per second is probably the reason that you're 
seeing inconsistencies.  ELSA is designed for high-volume processing.  The good 
news is I'm almost done bug testing the new code which will handle low-volume 
log rates much more gracefully, so this should be fixed in the new code when I 
release it shortly.

Original comment by mchol...@gmail.com on 24 Jun 2012 at 7:03

GoogleCodeExporter commented 8 years ago

I am anticipating 400-500 EPS when ELSA is rolled out to production, with 
spikes up to 2000-3000. Is the new version a separate code base or is it simply 
one which will scale from low EPS to high EPS gracefully?

Original comment by lib...@gmail.com on 24 Jun 2012 at 7:11

GoogleCodeExporter commented 8 years ago

The new version adds a standard feature to "downshift" into a realtime mode 
whenever the events per second are below a certain threshold.  I'm testing it 
thoroughly because it will be the default mode.  I have observed that it will 
comfortably handle 1k events per second in realtime mode, so that should cover 
your circumstances.

Original comment by mchol...@gmail.com on 24 Jun 2012 at 7:26

GoogleCodeExporter commented 8 years ago

Fixed in rev 330.  Unfortunately, it was a bug on the backend, so while the 
code update is normal, to actually get the fix working, you need to blow away 
the /usr/local/etc/sphinx.conf file on the logging nodes and then run "perl 
elsa.pl -on" to regenerate the new sphinx.conf file.  Then restart searchd.  
New indexes will have the correct time values in the reports and charts (all 
logs already have correct time values in the standard console listing).

Original comment by mchol...@gmail.com on 2 Jul 2012 at 8:52

Changed state: Fixed

SafeAF / enterprise-log-search-and-archive

Bar values do not correspond with searches #33