= drain java
image:https://github.com/bric3/drain-java/actions/workflows/gradle.yml/badge.svg[Java CI with Gradle,link=https://github.com/bric3/drain-java/actions/workflows/gradle.yml]
== Introduction
drain-java is a continuous log template miner, for each log message it extracts
tokens and group them into clusters of tokens. As new log messages are added,
drain-java will identify similar token and update the cluster with the new template,
or simply create a new token cluster. Each time a cluster is matched a counter is
incremented.
These clusters are stored in prefix tree, which is somewhat similar to a trie, but
here the tree as a fixed depth in order to avoid long tree traversal.
In avoiding deep trees this also helps to keep it balance.
== Usage
First, https://foojay.io/almanac/jdk-11/[Java 11] is required to run drain-java.
=== As a dependency
You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core
,
currently only https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/bric3/drain/[snapshots]
are available by adding this repository.
[source, kotlin]
=== From command line
Since this tool is not yet released the tool needs to be built locally.
Also, the built jar is not yet super user-friendly. Since it's not a finished
product, anything could change.
.Example usage
[source, shell]
$ ./gradlew build
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h
tail - drain
Usage: tail [-dfhV] [--verbose] [-n=NUM]
[--parse-after-str=FIXED_STRING_SEPARATOR]
[--parser-after-col=COLUMN] FILE
...
FILE log file
-d, --drain use DRAIN to extract log patterns
-f, --follow output appended data as the file grows
-h, --help Show this help message and exit.
-n, --lines=NUM output the last NUM lines, instead of the last 10; or use
-n 0 to output starting from beginning
--parse-after-str=FIXED_STRING_SEPARATOR
when using DRAIN remove the left part of a log line up to
after the FIXED_STRING_SEPARATOR
--parser-after-col=COLUMN
when using DRAIN remove the left part of a log line up to
COLUMN
-V, --version Print version information and exit.
--verbose Verbose output, mostly for DRAIN or errors
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version
Versioned Command 1.0
Picocli 4.6.3
JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR)
OS: Mac OS X 12.6 x86_64
By default, the tool act similarly to tail
, and it will output the file to the stdout.
The tool can follow a file if the --follow
option is passed.
However, when run with the --drain
this tool will classify log lines using DRAIN, and will
output identified clusters.
Note that this tool doesn't handle multiline log messages (like logs that contains a stacktrace).
On the SSH log data set we can use it this way.
[source, shell]
$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \
-d \ <1>
-n 0 \ <2>
--parse-after-str "]: " <3>
build/resources/test/SSH.log <4>
<1> Identify patterns in the log
<2> Starts from the beginning of the file (otherwise it starts from the last 10 lines)
<3> Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively
ignoring some variable elements like the time.
<4> The log file
.log pattern clusters and their occurences
[source]
--------
---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters <1>
0010 (size 140768): Failed password for <*> from <*> port <*> ssh2 <2>
0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0007 (size 68958): Connection closed by <*> [preauth]
0008 (size 46642): Received disconnect from <*> 11: <*> <*> <*>
0014 (size 37963): PAM service(sshd) ignoring max retries; <*> > 3
0012 (size 37298): Disconnecting: Too many authentication failures for <*> [preauth]
0013 (size 37029): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0011 (size 36967): message repeated <*> times: [ Failed password for <*> from <*> port <*> ssh2]
0006 (size 20241): Failed <*> for invalid user <*> from <*> port <*> ssh2
0004 (size 19852): pam unix(sshd:auth): check pass; user unknown
0001 (size 18909): reverse mapping checking getaddrinfo for <*> <*> failed - POSSIBLE BREAK-IN ATTEMPT!
0002 (size 14551): Invalid user <*> from <*>
0003 (size 14551): input userauth request: invalid user <*> [preauth]
0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*>
0018 (size 1289): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*>
0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth]
...
--------
<1> 51 _types_ of logs were identified from 655147 lines in 1.588s
<2> There was `140768` similar log messages with this pattern, with `3` positions
where the token is identified as parameter `<*>`.
On the same dataset, the java implementation performed roughly around 10 times faster.
As my implementation does not yet have masking, mask configuration was removed in the
Drain3 implementation.
=== From Java
This tool is not yet intended to be used as a library, but for the curious
the DRAIN algorythm can be used this way:
.Minimal DRAIN example
[source, java]
----
var drain = Drain.drainBuilder()
.additionalDelimiters("_")
.depth(4)
.build()
Files.lines(Paths.get("build/resources/test/SSH.log"),
StandardCharsets.UTF_8)
.forEach(drain::parseLogMessage);
// do something with clusters
drain.clusters();
----
== Status
Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java
file. Then I wrote in Java an implementation of Drain. Now here's what I would like to do.
.Todo
- [ ] More unit tests
- [x] Wire things together
- [ ] More documentation
- [x] Implement _tail follow_ mode (currently in drain mode the whole file is read and stops once finished)
- [ ] In follow drain mode dump clusters on forced exit (e.g. for example when hitting `ctrl`+`c`)
- [x] Start reading from the last x lines (like `tail -n 30`)
- [ ] Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)
.For later
- [ ] Json message field extraction
- [ ] How to handle prefixes : Dates, log level, etc. ; possibly using masking
- [ ] Investigate marker with specific behavior, e.g. log level severity
- [ ] Investigate log with stacktraces (likely multiline)
- [ ] Improve handling of very long lines
- [ ] Logback appender with micrometer counter
== Motivation
I was inspired by a https://sayr.us/log-pattern-recognition/logmine/[blog article from one of my colleague on LogMine],
-- many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log
pattern extraction of https://docs.datadoghq.com/logs/explorer/patterns/[Datadog's Log explorer], his blog post
sparked my interest.
After some discussion together, we saw that Drain was a bit superior to LogMine.
Googling Drain in Java didn't yield any result, although I certainly didn't search exhaustively,
but regardless this triggered the idea to implement this algorithm in Java.
== References
The Drain port is mostly a port of https://github.com/IBM/Drain3[Drain3]
done by IBM folks (_David Ohana_, _Moshik Hershcovitch_). IBM's Drain3 is a fork of the
https://github.com/logpai/logparser[original work] done by the LogPai team based on the paper of
_Pinjia He_, _Jieming Zhu_, _Zibin Zheng_, and _Michael R. Lyu_.
_I didn't follow up on other contributors of these projects, reach out if you think you have been omitted._
For reference here's the linked I looked at:
* https://logparser.readthedocs.io/
* https://github.com/logpai/logparser
* https://github.com/IBM/Drain3
* https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf
(a copy of this publication accessible link:doc/pjhe_icws2017.pdf[there])