[Algorithm] Drain output aggregate log cluster count over time

Superskyyy commented 2 years ago

Once we have the log clusters learnt by Drain,, we can enable anomaly detection without needing algorithms but simply plot them in time.

The idea is simple: if a type of log count surges or suddenly decreases over some points in time, it may be an anomaly given its content; this is up to the human operator to further decide (it may be just normal increases in access)

So we essentially generate a metric for the clustered logs, one for each cluster. And we plot them in SkyWalking UI.

See the below for what I'm saying

It's just an idea now; since most of the work is on UI during integration, the metrics calculation probably also should be done on the SkyWalking side before visualization.

wu-sheng commented 2 years ago

Logs usually carry service information, and you need to define how to define the cluster concept. Also, from statistics and anormal perspectives, visualization is a final step to show what is detected. The logs are queryable service oriented.

Aggregating logs in time seems a practical way to detect abnormal status, this could be done simply through MAL + LAL + Altering, rather than shipping logs to AI core. We should define the boundaries of the AIEngine.

Superskyyy commented 2 years ago

Logs usually carry service information, and you need to define how to define the cluster concept. Also, from statistics and anormal perspectives, visualization is a final step to show what is detected. The logs are queryable service oriented.

By log cluster, I mean the below ones. Each line is a cluster with a template in this service, and every single log reported by the service will belong to a category. The size parameter shows the number of logs belong to this cluster learnt by algorithm, it's already there, just need to do a simple counting by timestamp info (also already there in OAP side log index)

Taking the first line with log timestamp from OAP log index, a counter function can return the full trend for this log cluster [104, 208, 306, 407, 200, 900, ...] then it's a plotting job.

--- Done processing file in 25.76 sec. Total of 655147 lines, rate 25432.1 lines/sec, 51 clusters
ID=10    : size=140768    : Failed password for <*> from <IP> port <NUM> ssh2
ID=9     : size=140701    : pam unix(sshd auth) authentication failure; logname= uid=<NUM> euid=<NUM> tty=ssh ruser= <*> <*>
ID=7     : size=68958     : Connection closed by <IP> [preauth]
ID=8     : size=46642     : Received disconnect from <IP> <NUM> <*> <*> <*>
ID=14    : size=37963     : PAM service(sshd) ignoring max retries; <NUM> > <NUM>
ID=12    : size=37298     : Disconnecting Too many authentication failures for <*> [preauth]
ID=13    : size=37029     : PAM <NUM> more authentication <*> logname= uid=<NUM> euid=<NUM> tty=ssh ruser= <*> <*>
ID=11    : size=36967     : message repeated <NUM> times <Averylonglist[]>
ID=6     : size=20241     : Failed <*> for invalid user <*> from <IP> port <NUM> ssh2
ID=4     : size=19852     : pam unix(sshd auth) check pass; user unknown
ID=1     : size=18909     : reverse mapping checking getaddrinfo for <*> [<IP>] failed - POSSIBLE BREAK-IN ATTEMPT!
ID=2     : size=14551     : Invalid user <*> from <IP>
ID=3     : size=14551     : input userauth request invalid user <*> [preauth]
ID=5     : size=14356     : pam unix(sshd auth) authentication failure; logname= uid=<NUM> euid=<NUM> tty=ssh ruser= <*>
ID=18    : size=1289      : PAM <NUM> more authentication <*> logname= uid=<NUM> euid=<NUM> tty=ssh ruser= <*>
ID=24    : size=952       : fatal Read from socket failed Connection reset by peer [preauth]
ID=19    : size=932       : error Received disconnect from <IP> <NUM> <*> <*> <*> <*> <*> <*> [preauth]
ID=15    : size=838       : Did not receive identification string from <IP>
ID=17    : size=595       : Received disconnect from <IP> <NUM> <*> <*> <*> <*> <*> <*>
ID=31    : size=497       : Address <IP> maps to <*> but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
ID=20    : size=182       : Accepted password for <*> from <IP> port <NUM> ssh2
ID=21    : size=182       : pam unix(sshd session) session opened for user <*> by (uid=<NUM>)
ID=22    : size=182       : pam unix(sshd session) session closed for user <*>
ID=16    : size=177       : error Received disconnect from <IP> <NUM> com.jcraft.jsch.JSchException Auth <*> [preauth]
ID=32    : size=108       : Received disconnect from <IP> <NUM> [preauth]
ID=50    : size=92        : Received disconnect from <IP> <NUM> Normal Shutdown, Thank you for playing [preauth]
ID=42    : size=87        : Received disconnect from <IP> <NUM> <*> <*> <*> [preauth]
ID=46    : size=60        : Received disconnect from <IP> <NUM> disconnect [preauth]
ID=28    : size=30        : Invalid user <*> <*> from <IP>
ID=29    : size=30        : input userauth request invalid user <*> <*> [preauth]
ID=30    : size=30        : Failed password for invalid user <*> <*> from <IP> port <NUM> ssh2
ID=36    : size=13        : Invalid user from <IP>
ID=37    : size=13        : input userauth request invalid user [preauth]
ID=38    : size=13        : Failed <*> for invalid user from <IP> port <NUM> ssh2
ID=34    : size=7         : Bad protocol version identification <*> <*> from <IP> port <NUM>
ID=35    : size=7         : Bad protocol version identification 'GET <*> HTTP/<NUM>.<NUM>' from <IP> port <NUM>
ID=39    : size=7         : Bad protocol version identification <*> from <IP> port <NUM>
ID=33    : size=6         : fatal no hostkey alg [preauth]
ID=40    : size=6         : error Received disconnect from <IP> <NUM> <*> <*> <*> <*> [preauth]
ID=44    : size=6         : error connect to <IP> port <NUM> failed.
ID=23    : size=3         : fatal Write failed Connection reset by peer [preauth]
ID=43    : size=3         : error Received disconnect from <IP> <NUM> com.jcraft.jsch.JSchException timeout in waiting for rekeying process. [preauth]
ID=47    : size=3         : Server listening on <IP> port <NUM>.
ID=48    : size=3         : Server listening on port <NUM>.
ID=25    : size=2         : error Received disconnect from <IP> <NUM> User request [preauth]
ID=26    : size=1         : Bad packet length <NUM>. [preauth]
ID=27    : size=1         : Disconnecting Packet corrupt [preauth]
ID=41    : size=1         : Received disconnect from <IP> <NUM>
ID=45    : size=1         : Corrupted MAC on input. [preauth]
ID=49    : size=1         : Bad protocol version identification 'CONNECT xui.ptlogin2.qq.com <NUM> HTTP/<NUM>.<NUM>' from <IP> port <NUM>
ID=51    : size=1         : syslogin perform logout logout() returned an error

Aggregating logs in time seems a practical way to detect abnormal status, this could be done simply through MAL + LAL + Altering, rather than shipping logs to AI core. We should define the boundaries of the AIEngine.

Thank you for the discussion! @wu-sheng

As you said, Aggregating the logs in time can already be computed by the OAP side by existing analysis languages. But it's based on the fact that users need to have domain knowledge in the service to monitor, for specific log pattern is very useful, it could be even more accurate than AI solution, but it's not scalable to unknown types especially the operator is not familiar.

The AIOps version can be switched on only when users decides to cluster logs from this service because it's merely a byproduct of the log clustering algorithm as explained above. I do not intend this thing to compromise or replace OAP side functionality, the advantage is - Rule-based solution and AI solution can easily work together.

Aggregating the logs in time via AI Engine has one true charm that I find valuable - that is fully automatic and covers every single corner case of log cluster template that can happen in a service. As a by-product of log clustering that users already choose to enable when they use the AI engine (for example, the Drain algorithm can reduce 15 million logs into less than 80 log clusters, during the clustering, we already know which cluster has N logs, just do a time-based aggregation, then the metrics are out, very low computational overhead).

wu-sheng commented 2 years ago

My major question is about, why is this relative to UI and how? I am not following the logical relationship between this UI proposal and your explanation for anormal log detecting mechanism.

wu-sheng commented 2 years ago

If you are going to do scanning reading, A.K.A. full-table scan, which would be in low performance, and I am afraid you can't get all data in time.

Superskyyy commented 2 years ago

My major question is about, why is this relative to UI and how? I am not following the logical relationship between this UI proposal and your explanation for anormal log detecting mechanism.

My bad, the core connection is during the final integration, which SkyWalking UI will be adapted to provide a new log anomaly viewer to show the resulted metrics. I hope this diagram clarifies my design.

If you are going to do scanning reading, A.K.A. full-table scan, which would be in low performance, and I am afraid you can't get all data in time.

I see. Sounds like it's better to delegate the counting job to the AI engine. Each Drain clustering algorithm will maintain counters as incoming log data get parsed based on timestamp. . Then send the aggregated meters to OAP. Does this sound possible?

wu-sheng commented 2 years ago

AIEngine could send metrics to OAP, and OAP could export(changes required) logs rawdata to AIEngine. But what do you mean log templates? And what do you mean cluster and clusters? Current logs view is service or instance(pod) oriented, it nearly impossible for a human to read logs across services. Could you explain how a user would use this new widget on UI?

Superskyyy commented 2 years ago

AIEngine could send metrics to OAP, and OAP could export(changes required) logs rawdata to AIEngine. But what do you mean log templates? And what do you mean cluster and clusters? Current logs view is service or instance(pod) oriented, it nearly impossible for a human to read logs across services. Could you explain how a user would use this new widget on UI?

I think I see the point of confusion！Let me clarify it. The cluster term collides with the cloud computing cluster term.

TL'DR -> Cluster means a group of logs in a service/pod, Template means the log format without parameters for that cluster.

The name definition of clusters - is not related to service clusters. It's a general data science term that refers to a group of entities (logs within a service/instance) that are very similar most likely with different parameters. Think of it as a reverse process of 10,000,000 logs back to only their base part and sort them each into only < 100 groups (clusters), so humans can see very fast and clearly what type of logs are possible to contribute to a service failure.

logger.warn('The user %s has inputed the wrong user name %s at %s for %d times', sky, superskyyy, localhost, 3);
-> 'The user sky has inputed the wrong user name superskyyy at localhost for 3 times'
-> Back to -> -> 'The user [:*:] has inputed the wrong user name [:*:] at [:IP:] for [:N:] times'

We do strictly analyze in service by service (or pod by pod) way.

For example, here in the output of the Drain algorithm, I picked 3 example clusters in logs from a single service: It's learnt from 15 million logs from a single Hadoop distributed file system namenode. You can see there're 7 million logs are basically the same thing with different params.

ID=1     : size=7719153   : <DateTime> <INFO> org.apache.hadoop.hdfs.StateChange DIR* completeFile <*> temporary/<NUM>/ temporary/attempt <NUM> <NUM> m <NUM> <NUM>/part-<NUM> is closed by DFSClient NONMAPREDUCE <NUM> <NUM>

ID=10    : size=50693     : <DateTime> <INFO> BlockStateChange BLOCK* BlockManager ask <IP> <NUM> to delete <Averylonglist[]>

ID=23    : size=27598     : <DateTime> <INFO> org.apache.hadoop.hdfs.server.namenode.FSEditLog Rolling edit logs

Say we have 4 logs like below.

07-19 WARN Child process type UserManager issues scanning for group Admin with candidate [foo, bar, whatever, zhangsan, lisi], will last for 30 minutes
07-19 CRITICAL Child process type ABC potential failure at 192.168.0.1, system critical lasting 30 seconds.
07-19 WARN Child process type ABC potential failure at 192.168.0.1, system memory is not sufficient, used 9000MB, Max allowed 1GB.
07-19 WARN Child process type UserspManager issues scanning for group Admin with candidate [Test] will last for 1 minute.

Result will be 3 different groups (clusters): Cluster 1 (size=2). WARN Child process type issues scanning for group with candidate A long list, will last for NUM minutes Cluster 2 (size=1). CRITICAL Child process type potential failure at IP, system critical lasting Num seconds. Cluster 3 (size=1). WARN Child process type potential failure at IP, system memory is not sufficient, used Num MB, Max allowed GB.

I hope this clearifies the questions; thanks for reading. @wu-sheng

wu-sheng commented 2 years ago

Yes, thankd for the explanation. The flow is clear to me now.

wu-sheng commented 2 years ago

According to the flow, there is a background you should know. Considering to product env performance, there is no many logs, as debug info logs are disabled. So, if you need to detect anormal in time, and also avoid false alarm, you should choose proper period and relative thresholds. Also, in Codes like Java, error logs usually carry stacks, you should expect multiple lines per logic logs.

Superskyyy commented 2 years ago

According to the flow, there is a background you should know. Considering to product env performance, there is no many logs, as debug info logs are disabled. So, if you need to detect anormal in time, and also avoid false alarm, you should choose proper period and relative thresholds. Also, in Codes like Java, error logs usually carry stacks, you should expect multiple lines per logic logs.

Point understood, will bare this in mind when testing, especially for tracebacks.

Superskyyy commented 1 year ago

We will use the Redis HyperLogLog algorithm to do the counting job and directly send the counter metrics to OAP.

Superskyyy commented 1 year ago

This will be accomplished by keep tracking of template - service - log mapping during clustering process and update to a topic, then do aggregation.

SkyAPM / aiops-engine-for-skywalking

[Algorithm] Drain output aggregate log cluster count over time #11