Open Superskyyy opened 2 years ago
Some additional information on log merger, we need to merge the clusters with very minimal number of log hits to similar clusters based on a similarity threshold calculated using some equation.. though IDK which yet, gonna figure that out using some heuristics.
The algorithm enhancements have turned out to be effective. In light of issue #14 that requires further structural modification of the Drain3 code base, we have heavily modified the MIT-licensed Drain3 implementation and intend to host it in our repo (algorithm files will respect the original MIT license header).
Background: Drain log parsing works best on ingesting only log content - meaning we trim the rest with some simple Regex or rule. Slicing the contents accurately from
Dec 10 07:28:08 LabSZ sshd[24247]: Received disconnect from 112.95.230.3: 11: Bye Bye [preauth]
to below requires prior knowledge on the delimiter, which I am 99% sure users don't care to give. So we need to adapt Drain to be more robust.Received disconnect from 112.95.230.3: 11: Bye Bye [preauth]
I found a potentially(?) major enhancement to the algorithm on RAW log parsing.The current test is shown below yields much better clustering than the original unreadable results (over-convergence), but it also requires a tiny adjustment to global similarity threshold - So the idea is all clusters should have their own standard of accepting new templates, not by a global constraint. (This is mentioned in the updated version of research paper, not my invention)
I will attempt to submit a patch to the upstream IBM/Drain3 repo and see if it's accepted.
BUT! To yield the most accurate result, we still need to implement a dynamic threshold calculation and clustering merger for the similarity function;
Threshold 0.4 (Default, not best)
Threshold 0.3 compared to below baseline result, looks almost perfect
Original version without my patch, but sliced with prior knowledge, threshold 0.4 default