Marusyk / grok.net

.NET implementation of the grok 📝
MIT License
287 stars 55 forks source link

Tips, improvements, suggestions #34

Closed Marusyk closed 5 months ago

Marusyk commented 2 years ago

This project is a simple implementation of grok parser. It would be great to have some suggestions how it can be improved. Maybe some ideas for new features.

Marusyk commented 2 years ago

See https://github.com/Marusyk/grok.net/discussions/28

Marusyk commented 2 years ago

Add more benchmarks

Marusyk commented 2 years ago

See https://github.com/Marusyk/grok.net/pull/53

anjolaoluwaakindipe commented 1 year ago

I thought it would be a good idea if the package could change the logs of a specific structure, let's say JSON logs, to another structure like XML.

nickoppen commented 1 year ago

A limit on how hard grok.net tries to find a match. A limit on how long it takes or on how many times it backtracks.

Here's an example:

I'm writing a parser for the log produced by pihole. One of the lines has a security key on the end:

Jul 6 08:02:15 dnsmasq[24646]: reply 10dkim1._domainkey.email.grilld.com.au is Ef1gTSwbHsMQ/VBR7v58ALwSUL173mtLvgR9/jqJmjLlJn6NcgJsPWFOO2Kh

My template is:

%{LOGTIME:Timestamp} %{LOGPROG:Prog}: %{LOGQUERY:Action} %{LOGDOMAIN:Domain} %{LOGDIRECTION:Direction} %{LOGEOL:EndOfLine}

And my definitions are:

LOGTIME ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s{1,2}[0-9]{1,2} [0-9]{2}:[0-9]{2}:[0-9]{2} LOGPROG dnsmasq[\d{1,}] LOGQUERY config|regex blacklisted|exactly blacklisted|special domain|forwarded|reply|cached|gravity blocked|query[(A{1,4}|HTTPS|SOA|TXT|PTR|SRV|NAPTR|NS|type=\d{1,5})] LOGDIRECTION (from|is|to) LOGDOMAIN ((?:[A-Z0-9a-z-_~:\/?#[]@!\$&'()*+,:%=]).?) LOGIPV4ELEMENT [0-9]{1,3} LOGIPV6ELEMENT ([0-9]|[a-f]|[A-F]){0,4}:{1,2} LOGIPV4 %{LOGIPV4ELEMENT}.%{LOGIPV4ELEMENT}.%{LOGIPV4ELEMENT}.%{LOGIPV4ELEMENT} LOGIPV6 %{LOGIPV6ELEMENT}{1,8} LOGIP %{LOGIPV4}|%{LOGIPV6} LOGOTHER (\<HTTPS>|NXDOMAIN|NODATA|\<CNAME>|\<SRV>|\<NAPTR>|NODATA-IPv4) LOGRSA v=DKIM1.|; p=.|g=*.*|blocked during CNAME inspection LOGEOL %{LOGIP}|%{LOGOTHER}|%{LOGRSA} # LOGEOL refers to the log item's End Of Line

In my program this line stalls on parse. I'm not sure what it's doing but I let it run for 20 minutes without a result.

The page: grokconstructor.appspot.com also stalls and gives the following error:

Error message: net.stoerr.grokconstructor.TimeoutException: The request execution took too long. Sorry, we have to give up here. It's hard to give advice here: probably your regular expression leads to too much backtracking when it fails.

From this I assume that the algorithm is still running but has too many potential matches to return in a reasonable amount of time hence my suggestion.

BTW: Excellent library. Very useful.

Marusyk commented 1 year ago

@nickoppen thank you very much for your feedback. As soon as I have free time, I'll take a look (busy last few months) If you find out where the issue is, create a PR with a fix. I will really appreciate it

nickoppen commented 1 year ago

Here's what I propose...

Change Grok.Parse(string) to Grok.Parse(string, int maxIterations = MaxInt, long timerTicks = OneDay)

maxIterations would be the maximum number of iterations through the inner loop and timerTicks would limit the amount of time that parse could run.

It could be called with none, either or both depending on what the caller wanted.

I've written a short proof of concept here.

Marusyk commented 1 year ago

Hi @nickoppen Could you provide a repo with code to reproduce the behavior?

nickoppen commented 1 year ago

I've started putting together a test bed. I'll push it when it is ready.

nickoppen commented 1 year ago

Hi @Marusyk

I got the test bed running and ran the troublesome data through it. Tracing the code I realised that my initial thoughts how to do a time limit won't work. The code that stalls is the call to _compiledRegex.Matches(text) .

I've written another short solution that could be modified to use a timer to cancel the stalled routine if it responds to a CancellationToken. I don't know if it does so I'll try it out.

Cheers,

nick

nickoppen commented 1 year ago

Hi Roman,

Solution 3 did the trick and only in a couple of lines of code.

I didn't get the pull/push thing right so here's what I've done.

    public GrokResult Parse(string text, long ticks = 0)
    {
        if (_compiledRegex == null)
        {
            ParseGrokString();
        }

        var grokItems = new List<GrokItem>();

        Thread worker = new Thread(() => { var matches = _compiledRegex.Matches(text);
                                            foreach (Match match in matches)
                                            {
                                                foreach (string groupName in _groupNames)
                                                {
                                                    if (groupName != "0")
                                                    {
                                                        grokItems.Add(_typeMaps.ContainsKey(groupName)
                                                            ? new GrokItem(groupName, MapType(_typeMaps[groupName], match.Groups[groupName].Value))
                                                            : new GrokItem(groupName, match.Groups[groupName].Value));
                                                    }
                                                }
                                            }
                                        });
        worker.Start();
        worker.Join(new TimeSpan((ticks == 0) ? 864000000000 : ticks));

        return new GrokResult(grokItems);
    }
nickoppen commented 1 year ago

I've updated my grok test bed to make it more general. I'll do some more testing of the solution and let you know how it goes.

Marusyk commented 1 year ago

Sorry, I don't have time for this right now, however, I'm sure that specifying ticks or timer or iteration is not the best way. We can improve the work with regex, perhaps migrating to source generated regex

nickoppen commented 1 year ago

Hi.

No problem. If you can think of a better way I'm always interested in learning about different solutions.

I'm in no hurry either. My little mod has the characteristics I was after so I'll run with that for now.

nick