AdaptiveStep commented 1 month ago

How to use the parser?

This is more of a "conversation thread" rather than a bug-report.

I posted these questions on the "logpai" issue but will post them here again.

1: Drain3 has persistence with File/redis/kafka. How to set persistence up with brain? Is it possible?

2: Drain3 also has livestream inputs with sdin. I cannot find any documentation on this for Brain. Is it possible to do with Brain? . According to the documentation Brain only parses entire log-files once. But how to set up an actual pipeline so that json objects are sent to other API's (for db, analysis, visualization etc)?

3: It is common standard in the industry to use a collector such as Vector or logstash/fluent or similar. How can Brain be used with these? Are we supposed to containerize python/brain in a deployment or what?

I am not sure how to actually set this up so that things (log lines) go through the "brain parsing processor", but I am sure that a really good way would be something similar to the way Drain3 is setup with kafka (so that it can easily scale and be persistent).

TLDR?;

How to setup brain in an existing infrastructure?

gaiusyu commented 1 month ago

Thank you for your interest in my work. Similar to Drain, Brain is a research project. Drain3 is more practical and tailored for industry. Therefore, Brain might lack some documentation to guide practitioners in deploying it in the industry, and we apologize for that. As mentioned in our paper, Brain supports BATCH PROCESSING rather than real-time parsing. Although we have some ideas for real-time parsing with Brain (the longest common pattern), they have not been validated yet.

AdaptiveStep commented 3 weeks ago

Ok thank you for the reply! No need to apologize friend. Your work is really impressive and I am extremely impressed by how far you cave gotten so far and I am looking forward to more auto-parsing tech!! Now, please forgive me for the long comment below but I think these architectural considerations might be useful for you when you develop Brain/Drain/Logparser features.

In my investigations I've managed to discover some things and have these ideas:

Lets assume that drain3 is used. First question is: how do you personally suggest that the template parsing is done (from architectural point of view)? Do we re-deploy the entire "Drain3 stack" for every app? This way the templates won't get inter-mingled? Is this useful or Is it avoidable with your "longest pattern" strategy? In the case its useful to separate formats, it might be a good idea to implement some kind of "namespace" or "label" for each format.

Usually the formats are following an extremely consistent template "within the app itself". So you only need to separate them "per app". (provided the template isn't ruined by multi-line logs). But lets ignore multiline logs for now, they can be ignored since they are a separate issue. (And by the way, there are other more specific technologies for handling multiline logs such as autoinstrumentation and OTEL-tracing, which we won't get into here). Ignore multiline-logs for now. Therefore: Assume that, every log is a nice "one-liner".

idea1: Opensearch

I was thinking of saving the state in an index in opensearch (which is where all the parsed data will go to as well), since this would simplify the scaling, retention, generalization, anomalydetection, visualization, and so on. And then "sync with state every 10 minutes or so".

Setting up the "state location" has been a little difficult and complicated. I tried to to create a new "OpensearchPersistence" but failed. A similar idea would perhaps be to set up an opensearch plugin on the opensearch java application itself. This way a "Drain3 GUI" would integrate with the rest of the analytics for free! Building plugins for opensearch is however a monolithic design-pattern which will depend heavily on the stability on opensarch.

idea2: Kafka

Since the documentation is using Kafka at the moment, I have a simpler question: Is it possible to do a unique kafka topic (for each log-source)? As it seems, for now, I would need a completely new kafka DB for every application in that case. Could that be a problem? Or is redis the recommended way? Lets say I have logs coming from nginx (And logs coming from windows, and sshd and some other apps). How to set this up? The demos only demonstrate how to handle a single format at the time (a single app). Handling multiple apps is an important piece of this puzzle.

Idea3: Cron

Maybe a cron job could be in order to simply run brain on specific "log-files" every x minutes? And then let it tail certain log files? In this case Drain would need to remember the line position it tailed for every new logline. .

idea4: Namespace

Consider this: In the documentation: drain3 parses like this:

...
result = template_miner.add_log_message(line)
...

But what if we could simply add a namespace/label like this, and keep all the app formats in one state, while separating them by namespace?:

...
    result = template_miner.add_log_message(line, namespace="nginx") # returns a function here ! not a dictionary!. If namespace is not provided it will use the "default" namespace.

    #try getting params
    template = result("template_mined", "nginx")
    template_miner.extract_parameters(template, line)
...

Please let me know if you think these are bad ideas, because these are the paths and ideas I'm considering when moving forward with drain3! And forgive me for the long post!

Thank you!

gaiusyu commented 3 weeks ago

Apologies for my late response. I took some time to understand the issues you mentioned, because I really lack some practical experience and do not know much about Drain3. Here are my thoughts on the questions you raised; please feel free to point out my any misunderstandings, as I am very willing to discuss this further with you:

All your concerns are about persistence, right? As I understand it, the background of this issue is about real-time parsers because they continuously update the established model, such as Drain's tree. If you can persistently store the tree, it will save the overhead of parsing each time and might improve accuracy. Are you facing the issue of mixed logs from different apps now? I think the batch processing Brain does not encounter this problem, because Brain builds an independent model for each batch, completes the parsing, and there is no correlation between batches. For real-time parsing, as you mentioned, I think the namespace is a good solution, provided the log formats of different apps are known. You can use regex to determine which log format the input log matches, and then fill in the corresponding parameters in parse(line, namespace="nginx") to get and update the corresponding model.

After briefly reading the Drain3 documentation, I think Drain3 would need a completely new Kafka DB for every application in that case. I believe creating different models for logs from different apps is inevitable since the log templates of different apps are likely completely different. For Brain's "Longest pattern," it is also entirely different, and there is no need to intermingle them, as intermingling would increase search time and reduce efficiency. For the models of Drain and Drain3, it's also unnecessary to intermingle the models of different apps, as I think it would increase search time and reduce accuracy.

gaiusyu / Brain

How to use Brain in logging pipeline? #7