Closed Marusyk closed 5 months ago
Add more benchmarks
I thought it would be a good idea if the package could change the logs of a specific structure, let's say JSON logs, to another structure like XML.
A limit on how hard grok.net tries to find a match. A limit on how long it takes or on how many times it backtracks.
Here's an example:
I'm writing a parser for the log produced by pihole. One of the lines has a security key on the end:
Jul 6 08:02:15 dnsmasq[24646]: reply 10dkim1._domainkey.email.grilld.com.au is Ef1gTSwbHsMQ/VBR7v58ALwSUL173mtLvgR9/jqJmjLlJn6NcgJsPWFOO2Kh
My template is:
%{LOGTIME:Timestamp} %{LOGPROG:Prog}: %{LOGQUERY:Action} %{LOGDOMAIN:Domain} %{LOGDIRECTION:Direction} %{LOGEOL:EndOfLine}
And my definitions are:
LOGTIME ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s{1,2}[0-9]{1,2} [0-9]{2}:[0-9]{2}:[0-9]{2} LOGPROG dnsmasq[\d{1,}] LOGQUERY config|regex blacklisted|exactly blacklisted|special domain|forwarded|reply|cached|gravity blocked|query[(A{1,4}|HTTPS|SOA|TXT|PTR|SRV|NAPTR|NS|type=\d{1,5})] LOGDIRECTION (from|is|to) LOGDOMAIN ((?:[A-Z0-9a-z-_~:\/?#[]@!\$&'()*+,:%=]).?) LOGIPV4ELEMENT [0-9]{1,3} LOGIPV6ELEMENT ([0-9]|[a-f]|[A-F]){0,4}:{1,2} LOGIPV4 %{LOGIPV4ELEMENT}.%{LOGIPV4ELEMENT}.%{LOGIPV4ELEMENT}.%{LOGIPV4ELEMENT} LOGIPV6 %{LOGIPV6ELEMENT}{1,8} LOGIP %{LOGIPV4}|%{LOGIPV6} LOGOTHER (\<HTTPS>|NXDOMAIN|NODATA|\<CNAME>|\<SRV>|\<NAPTR>|NODATA-IPv4) LOGRSA v=DKIM1.|; p=.|g=*.*|blocked during CNAME inspection LOGEOL %{LOGIP}|%{LOGOTHER}|%{LOGRSA} # LOGEOL refers to the log item's End Of Line
In my program this line stalls on parse. I'm not sure what it's doing but I let it run for 20 minutes without a result.
The page: grokconstructor.appspot.com also stalls and gives the following error:
Error message: net.stoerr.grokconstructor.TimeoutException: The request execution took too long. Sorry, we have to give up here. It's hard to give advice here: probably your regular expression leads to too much backtracking when it fails.
From this I assume that the algorithm is still running but has too many potential matches to return in a reasonable amount of time hence my suggestion.
BTW: Excellent library. Very useful.
@nickoppen thank you very much for your feedback. As soon as I have free time, I'll take a look (busy last few months) If you find out where the issue is, create a PR with a fix. I will really appreciate it
Here's what I propose...
Change Grok.Parse(string) to Grok.Parse(string, int maxIterations = MaxInt, long timerTicks = OneDay)
maxIterations would be the maximum number of iterations through the inner loop and timerTicks would limit the amount of time that parse could run.
It could be called with none, either or both depending on what the caller wanted.
I've written a short proof of concept here.
Hi @nickoppen Could you provide a repo with code to reproduce the behavior?
I've started putting together a test bed. I'll push it when it is ready.
Hi @Marusyk
I got the test bed running and ran the troublesome data through it. Tracing the code I realised that my initial thoughts how to do a time limit won't work. The code that stalls is the call to _compiledRegex.Matches(text) .
I've written another short solution that could be modified to use a timer to cancel the stalled routine if it responds to a CancellationToken. I don't know if it does so I'll try it out.
Cheers,
nick
Hi Roman,
Solution 3 did the trick and only in a couple of lines of code.
I didn't get the pull/push thing right so here's what I've done.
public GrokResult Parse(string text, long ticks = 0)
{
if (_compiledRegex == null)
{
ParseGrokString();
}
var grokItems = new List<GrokItem>();
Thread worker = new Thread(() => { var matches = _compiledRegex.Matches(text);
foreach (Match match in matches)
{
foreach (string groupName in _groupNames)
{
if (groupName != "0")
{
grokItems.Add(_typeMaps.ContainsKey(groupName)
? new GrokItem(groupName, MapType(_typeMaps[groupName], match.Groups[groupName].Value))
: new GrokItem(groupName, match.Groups[groupName].Value));
}
}
}
});
worker.Start();
worker.Join(new TimeSpan((ticks == 0) ? 864000000000 : ticks));
return new GrokResult(grokItems);
}
I've updated my grok test bed to make it more general. I'll do some more testing of the solution and let you know how it goes.
Sorry, I don't have time for this right now, however, I'm sure that specifying ticks or timer or iteration is not the best way. We can improve the work with regex, perhaps migrating to source generated regex
Hi.
No problem. If you can think of a better way I'm always interested in learning about different solutions.
I'm in no hurry either. My little mod has the characteristics I was after so I'll run with that for now.
nick
This project is a simple implementation of grok parser. It would be great to have some suggestions how it can be improved. Maybe some ideas for new features.