bric3 / drain-java

This a pet project to explore log pattern extraction using DRAIN
Mozilla Public License 2.0
25 stars 9 forks source link
drain java log tail template-mining

= drain java

image:https://github.com/bric3/drain-java/actions/workflows/gradle.yml/badge.svg[Java CI with Gradle,link=https://github.com/bric3/drain-java/actions/workflows/gradle.yml]

== Introduction

drain-java is a continuous log template miner, for each log message it extracts tokens and group them into clusters of tokens. As new log messages are added, drain-java will identify similar token and update the cluster with the new template, or simply create a new token cluster. Each time a cluster is matched a counter is incremented.

These clusters are stored in prefix tree, which is somewhat similar to a trie, but here the tree as a fixed depth in order to avoid long tree traversal. In avoiding deep trees this also helps to keep it balance.

== Usage

First, https://foojay.io/almanac/jdk-11/[Java 11] is required to run drain-java.

=== As a dependency

You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core, currently only https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/bric3/drain/[snapshots] are available by adding this repository.

[source, kotlin]

repositories { maven { url("https://oss.sonatype.org/content/repositories/snapshots/") } }

=== From command line

Since this tool is not yet released the tool needs to be built locally. Also, the built jar is not yet super user-friendly. Since it's not a finished product, anything could change.

.Example usage [source, shell]

$ ./gradlew build $ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h

tail - drain Usage: tail [-dfhV] [--verbose] [-n=NUM] [--parse-after-str=FIXED_STRING_SEPARATOR] [--parser-after-col=COLUMN] FILE ... FILE log file -d, --drain use DRAIN to extract log patterns -f, --follow output appended data as the file grows -h, --help Show this help message and exit. -n, --lines=NUM output the last NUM lines, instead of the last 10; or use -n 0 to output starting from beginning --parse-after-str=FIXED_STRING_SEPARATOR when using DRAIN remove the left part of a log line up to after the FIXED_STRING_SEPARATOR --parser-after-col=COLUMN when using DRAIN remove the left part of a log line up to COLUMN -V, --version Print version information and exit. --verbose Verbose output, mostly for DRAIN or errors $ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version Versioned Command 1.0 Picocli 4.6.3 JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR) OS: Mac OS X 12.6 x86_64

By default, the tool act similarly to tail, and it will output the file to the stdout. The tool can follow a file if the --follow option is passed. However, when run with the --drain this tool will classify log lines using DRAIN, and will output identified clusters. Note that this tool doesn't handle multiline log messages (like logs that contains a stacktrace).

On the SSH log data set we can use it this way.

[source, shell]

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \ -d \ <1> -n 0 \ <2> --parse-after-str "]: " <3> build/resources/test/SSH.log <4>

<1> Identify patterns in the log <2> Starts from the beginning of the file (otherwise it starts from the last 10 lines) <3> Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively ignoring some variable elements like the time. <4> The log file .log pattern clusters and their occurences [source] -------- ---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters <1> 0010 (size 140768): Failed password for <*> from <*> port <*> ssh2 <2> 0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> <*> 0007 (size 68958): Connection closed by <*> [preauth] 0008 (size 46642): Received disconnect from <*> 11: <*> <*> <*> 0014 (size 37963): PAM service(sshd) ignoring max retries; <*> > 3 0012 (size 37298): Disconnecting: Too many authentication failures for <*> [preauth] 0013 (size 37029): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> <*> 0011 (size 36967): message repeated <*> times: [ Failed password for <*> from <*> port <*> ssh2] 0006 (size 20241): Failed <*> for invalid user <*> from <*> port <*> ssh2 0004 (size 19852): pam unix(sshd:auth): check pass; user unknown 0001 (size 18909): reverse mapping checking getaddrinfo for <*> <*> failed - POSSIBLE BREAK-IN ATTEMPT! 0002 (size 14551): Invalid user <*> from <*> 0003 (size 14551): input userauth request: invalid user <*> [preauth] 0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> 0018 (size 1289): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> 0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth] ... -------- <1> 51 _types_ of logs were identified from 655147 lines in 1.588s <2> There was `140768` similar log messages with this pattern, with `3` positions where the token is identified as parameter `<*>`. On the same dataset, the java implementation performed roughly around 10 times faster. As my implementation does not yet have masking, mask configuration was removed in the Drain3 implementation. === From Java This tool is not yet intended to be used as a library, but for the curious the DRAIN algorythm can be used this way: .Minimal DRAIN example [source, java] ---- var drain = Drain.drainBuilder() .additionalDelimiters("_") .depth(4) .build() Files.lines(Paths.get("build/resources/test/SSH.log"), StandardCharsets.UTF_8) .forEach(drain::parseLogMessage); // do something with clusters drain.clusters(); ---- == Status Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java file. Then I wrote in Java an implementation of Drain. Now here's what I would like to do. .Todo - [ ] More unit tests - [x] Wire things together - [ ] More documentation - [x] Implement _tail follow_ mode (currently in drain mode the whole file is read and stops once finished) - [ ] In follow drain mode dump clusters on forced exit (e.g. for example when hitting `ctrl`+`c`) - [x] Start reading from the last x lines (like `tail -n 30`) - [ ] Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data) .For later - [ ] Json message field extraction - [ ] How to handle prefixes : Dates, log level, etc. ; possibly using masking - [ ] Investigate marker with specific behavior, e.g. log level severity - [ ] Investigate log with stacktraces (likely multiline) - [ ] Improve handling of very long lines - [ ] Logback appender with micrometer counter == Motivation I was inspired by a https://sayr.us/log-pattern-recognition/logmine/[blog article from one of my colleague on LogMine], -- many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log pattern extraction of https://docs.datadoghq.com/logs/explorer/patterns/[Datadog's Log explorer], his blog post sparked my interest. After some discussion together, we saw that Drain was a bit superior to LogMine. Googling Drain in Java didn't yield any result, although I certainly didn't search exhaustively, but regardless this triggered the idea to implement this algorithm in Java. == References The Drain port is mostly a port of https://github.com/IBM/Drain3[Drain3] done by IBM folks (_David Ohana_, _Moshik Hershcovitch_). IBM's Drain3 is a fork of the https://github.com/logpai/logparser[original work] done by the LogPai team based on the paper of _Pinjia He_, _Jieming Zhu_, _Zibin Zheng_, and _Michael R. Lyu_. _I didn't follow up on other contributors of these projects, reach out if you think you have been omitted._ For reference here's the linked I looked at: * https://logparser.readthedocs.io/ * https://github.com/logpai/logparser * https://github.com/IBM/Drain3 * https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf (a copy of this publication accessible link:doc/pjhe_icws2017.pdf[there])